INFORMATION EXCHANGE IN IABIN:
Technical Issues to be Considered in the Implementation
of the Inter-American Biodiversity Information Network
April 13, 1999
Prepared for the
Technical Meeting for the Establishment of IABIN
Brasilia, Brazil
EXECUTIVE SUMMARY
I. BACKGROUND
II. APPROACH
III. GOALS
IV. STRATEGY
V. FIRST STEPS
VI. PILOT PROJECT LEVELS OF PARTICIPATION
Country Representatives
Country Nodes
IABIN Partners
Funding ConsiderationsVII. DATA STRUCTURES
VIII. VOCABULARIES
IX. MAPPING SPECIES OCCURRENCES: A DISTRIBUTED DATABASE APPLICATION
X. Z39.50 PROFILES
XI. METADATA AND ELECTRONIC PUBLISHING
XII. TECHNICAL CAPABILITIES NEEDED FOR IABIN SITES
Web Site Technology
Connectivity
Security
Search and Query TechnologyXIII. CONCLUSIONS
APPENDIX: List of Acronyms
EXECUTIVE SUMMARY
The Inter-American Biodiversity Information Network (IABIN) was mandated in the action plan arising from the Santa Cruz (Bolivia) Summit of the Americas on Sustainable Development. Central to the implementation of is an understanding of the technical challenges which will need to be addressed by the participants. This study, in conjunction with the development of plans for a pilot project on invasive species in the Americas, undertook to discover and define those challenges and offer recommendations to IABIN participants concerning information exchange in light of such challenges.
Any effective framework for networking biodiversity information in the Americas must address at least the following considerations:
The technical approaches to the implementation of IABIN should build on, and reflect, the approaches taken by other international biodiversity networking initiatives. The first step is to inventory available information, technology, and information needs. This should be done in a formal and organized fashion, developing consistent catalogs of experts, partner organizations, species and resources of concern to those experts and organizations, existing data sets, major existing scientific and management projects, and various kinds of resulting synthetic data, educational materials, and capacity-building opportunities. Because resource cataloging is almost insurmountable task if attempted for all important biodiversity resources in the hemisphere, catalogs should first be developed for a small number of well-defined environmental issues of particularly high visibility and importance. These are the IABIN pilot projects.
Individual projects must select standards and protocols to be used,
including metadata strategies, data structures, controlled vocabularies,
electronic publishing (including peer review) processes, etc. Web site
technology, connectivity, security, and search and query technology are
other technical issues which must be considered. If carefully done, the
information systems developed for these pilot projects should be widely
applicable to other biodiversity issues considered by IABIN.
Summary of Specific Recommendations
Recommendation 1: IABIN should adopt the list of databases as
a minimum set for information sharing in the Americas on invasive vascular
plants, invasive fish, specialist pollinators, and amphibians, insofar
as funds and personnel are available.
Recommendation 2: Each country participating in each pilot should designate a primary country representative. Ideally, each country representative should both participate in or collaborate with the country's official policy toward the Biodiversity Convention and IABIN, and be affiliated with an organization with the interest, facilities, and computer expertise to host a moderately sophisticated database center and Web site.
Recommendation 3: IABIN should explore adapting the North American Biodiversity Information Network's (NABIN) "Species Analyst" approach and software to develop a distributed species mapping capability for species groups (invasive plant, invasive fish, declining amphibians, selected declining pollinators, hanta virus outbreaks), as funds allow, as a prototype for a more ambitious effort to link biodiversity databases using public-domain Z39.50 server technology.
Acknowledgments
This report was prepared by Dr. James F. Quinn, Department of Environmental
Science and Policy, University of California at Davis. Funding for this
study was provided by the United States Agency for International Development,
Project #598-0780, "Environmental Support Project," under an Interagency
Agreement with the U.S. Department of the Interior. Project management
was provided by the International Biological Informatics Program of U.S.
Geological Survey.
I. BACKGROUND
In December 1996, leaders of the governments of the Americas met at the Santa Cruz (Bolivia) Summit on Sustainable Development. Government leaders recognized the importance of reliable and accurate information on biodiversity in decision-making and the need for cooperation among the countries of the Western Hemisphere to link information sources together. Summit leaders agreed to:
Seek to establish an Inter-American Biodiversity Information Network, primarily through the Internet, that will promote compatible means of collection, communication and exchange of information relevant to decision-making and education on biodiversity conservation, and that builds upon such initiatives such as the Clearing-House Mechanism provided for in the United Nations Convention on Biological Diversity, the Man and the Biosphere Network (MABNet), and the Biodiversity Conservation Information System (BCIS), an initiative of nine IUCN programs and partners.This declaration, Initiative 31, prompted a series of informal meetings among interested parties, which were followed by two Experts' Meetings, sponsored by the Organization of American States, regarding the establishment of IABIN. At the Experts' Meeting in January 1998, the United States Geological Survey announced an inter-agency agreement with the U.S. Agency for International Development to support planning of the IABIN concept by identifying informational needs for sharing biodiversity information among IABIN partners, and by initiating several pilot projects. The purpose of the pilots is in part to demonstrate and test communications strategies for a wider range of biodiversity issues important to IABIN partners, and in part to begin the actual process of building a better infrastructure for exchanging scientific and management information on biodiversity in the hemisphere.
U.S. participants in IABIN met in October, 1998, in Alexandria, Virginia, to review available information and develop recommendations on goals, institutional frameworks, legal constraints, and information sharing for IABIN, and to develop priorities for pilot projects. As in earlier meetings, invasive species were identified as a priority for international information networking. Invasive species have huge ecological and economic impacts, have generated a substantial body of scientific knowledge which could be assembled into a useful information framework, and are of immediate concern to a number of present and potential IABIN cooperators. To keep an initial pilot project on invasive species manageable, the U.S. experts decided to concentrate on non-indigenous vascular plants and freshwater fish. Following similar logic, breakout groups in Alexandria also recommended pilot projects concerning amphibians and specialist pollinators, both of which have experienced marked declines on a vast geographic scale. Other expert groups have suggested other taxa or ecological guilds (e.g., corals vis-a-vis bleaching) as additional candidates for IABIN pilots.
Shortly after the Alexandria meeting, the U.S. Geological Survey, The Nature Conservancy, and the University of California sponsored a workshop at the National Center for Ecological Assessment and Synthesis at the University of California, Santa Barbara, to recommend strategies for both an invasive species pilot and an overview of information strategies to support that pilot and other related initiatives of IABIN. Participants and subsequent reviewers represent 10 countries and a range of governmental, non-profit, and academic organizations. The resulting recommendations for invasives are found in the report entitled, "Inter-American Biodiversity Information Network (IABIN): Invasive Species in the Americas Pilot Projects."
This report represents a parallel effort to review some approaches to
information strategies to support the proposed pilot projects. The particular
framework described was specifically designed around the information needed
to assess and manage effects of invasive species. However, the recommendations
should be broadly applicable to pollinators, amphibians, corals, or other
species-occurrence-based initiatives arising within IABIN.
II. APPROACH
All of the IABIN meetings to date have concluded that a systematic approach to discovering information sources and facilitating exchange is fundamental to establishing an inter-American information network on biodiversity. Within a number of countries, efforts have already resulted in the establishment of national information networks; various sub-regional network efforts are also proceeding, for example in Central America. In the Andean nations, a network is focusing on exchanging social and economic data. Extending national and sub-regional biodiversity efforts into a hemisphere-wide information network is a next step and reflects the approach -- to build on existing efforts -- directed in Initiative 31 of the Santa Cruz summit action plan.
Setting up an information network involves a wide range of challenges involving information technology, informatics, and information infrastructure. At the international level, an information network takes additional complexities: distances, telephone connections, connectivity, among many others. Even in an ideal technical configuration, information challenges include issues such as how information is stored, discovered, filtered, and exchanged.
Many organizations in the Americas have already confronted some of the technical and information challenges specific to their own sites, and the IABIN network development efforts should build on these lessons learned. In particular, it is important that any international efforts arising out of IABIN build upon, and communicate with, the initiatives of the Clearing-House Mechanism (CHM) of the Convention on Biological Diversity (http://www.biodiv.org/chm/), the MABNet Americas program (http://www.mabnetamericas.org/mabnet/home.html), and the extensive national efforts to build biodiversity information systems in partner countries. Particularly active examples include the biodiversity database efforts of the Base de Dados Tropical (http://www.bdt.org.br/bdt/) in Brazil, INBio (http://www.inbio.ac.cr/) in Costa Rica, CONABIO (http://www.conabio.gob.mx/) in Mexico, and the National Biological Information Infrastructure (NBII) in the United States (http://www.nbii.gov), but there are important national resources for biodiversity information in every participating country.
There are also a number of complementary international and global efforts
to coordinate biodiversity information, especially on protected lands.
Notable examples include the Biodiversity Conservation Information System
(http://www.biodiversity.org/),
The Nature Conservancy's network of Conservation Data Centers (http://www.consci.tnc.org/src/bcdover.html),
and the Man and the Biosphere program's Biosphere Reserve Integrated Monitoring
initiative
(http://www.usmab.org/brim/home.html). While these programs
share many goals, they are not currently particularly interoperable in
their data structures, use of names and terminology, or software. A goal
of IABIN should certainly be to build upon the knowledge represented in
these efforts, and to enhance the conservation efforts they represent.
However seamless integration of the information among major biodiversity
databases remains a very long term objective.
III. GOALS
Any effective framework for networking biodiversity information in the Western Hemisphere must address at least the following considerations:
IV. STRATEGY
An information network serves a number of distinct functions that can be organized into a number of tiers of complexity:
The previous levels are intended to permit a user to identify a
useful body of information (typically a database), and to understand its
source, structure and applicability for particular tasks well enough to
make use of it for purposes not foreseen by the data developer. The ultimate
goal, of course, is to have some kinds of comparative data available from
multiple sources covering large geographic areas. Sharing data has traditionally
been done by having one site collect data from multiple contributors and
compile a single summary database. While there are numerous worthy biodiversity
compendia of this kind in the Americas, all suffer from infrequent updates,
a lack of the local knowledge needed to interpret particular records, and
limited scope. The principal goal of IABIN is to provide access to the
most current records from a wider range of experts - in other words, a
fully distributed database.
In practice, neither software, nor network capacity, nor consistency in methods now permit a fully distributed biodiversity database (live links to searchable digital collection labels for all American fish collections, for example). This kind true interoperability is a long term goal, and requires both improved technology and substantial funding to standardize and digitize data. In the meantime, it is reasonable to collect abstracted data from multiple sources. For example, IABIN participants might want to share the simplest core elements of species occurrence records (date, location, species, collector, repository), without communicating important ancillary information (morphology, genetics, diet, chemicals in tissues, etc.) which may be special to an individual study. This would permit the construction of range maps and rates of range expansion, even if the collected data were inadequate to analyze the effects of diet or pollution a particular species.
As a result, the final two tiers of data are
V. FIRST STEPS
Given the current explosion of information on the topics of the proposed pilots, it is important that the strategy be both incremental and designed to produce useful results early in the project.
A majority of the proposed pilot projects (invasive plants, invasive fish, pollinators, amphibians, bleaching corals, hanta-virus outbreaks) share common features that justify a coordinated approach to their data systems. The October, 1998, Santa Barbara workgroup of experts on invasive species suggested eight data themes for better tracking effects of invasive species in the Americas. All are intended to be on-line and searchable, and most are designed to use the G8 Global Information Locator (GILS) format. The themes are:
Recommendation 1: IABIN should adopt the list of databases as a minimum set for information sharing in the Americas on invasive vascular plants, invasive fish, specialist pollinators, and amphibians, insofar as funds and personnel are available.
Some of the databases (experts, species of concern) can be compiled in a straightforward fashion through questionnaires, digital templates (e.g., stand alone Access data entry forms), or Web-based data entry. Most of the others require coordination at the level of a country or major institution, The invasives working group recommended that one or more country representatives be established for each country participating in each pilot. The representative(s) would be responsible for designating, and establishing if necessary, a main country Internet node for each pilot. Suggested institutional and technical capabilities and budgetary requirements for each node are discussed later, but in most countries the country representative and node should probably be associated with the government agency responsible for IABIN. Numerous secondary nodes -- for example, at universities, museums, and non-governmental organization (NGO) offices -- can be established. The working group encouraged country representative to take advantage of existing digital data on biodiversity held by researchers, museums, Conservation Data Centers, parks and refuges, Biosphere Reserves, and other conservation-oriented organizations.
Recommendation 2: Each country participating in each pilot
should designate a primary country representative. Ideally, each country
representative should both participate in or collaborate with the country's
official policy toward the Biodiversity Convention and IABIN, and be affiliated
with an organization with the interest, facilities, and computer expertise
to host a moderately sophisticated database center and Web site.
VI. PILOT PROJECT LEVELS OF PARTICIPATION
As a first step toward the development of a hemisphere-wide information system on invasive plants and fishes, participants in the Santa Barbara working group recommended a pilot project with several levels of participation in each country:
Country Nodes
A country node which meets minimum Information Technology standards
should be designated or established. These minimum IT standards include:
IABIN Partners
A number of organizations or sites within a country might be designated
as IABIN Partners. Minimum requirements for participation in the pilot
should probably include:
Because the purpose of the proposed pilots is to demonstrate exchange of digital information on focused biodiversity problems, partners would be expected to have substantial information available to contribute in digital form. However, it is clear that many potentially valuable partners (from a scientific perspective) still have limited computer or telecommunications capabilities. Consequently, partners would be expected to have a data entry computer and standard office software, and to have e-mail (reliable enough that a one day maximum turnaround could be expected of the system administrator). However, communications of large data sets might be transmitted on disk by mail rather than over the Internet. (This situation will change; with funding, satellite links to even the most remote areas may be expected in the near future.)
Funding Considerations
Funding needs depend upon the identity of the participant countries
and organizations in the pilots. At a bare minimum, the national nodes
probably need both a system administrator and at least one full time data
expert to work on particular pilots (doing data management, programming,
web site construction, etc.). Salary and expenses for the national representative
are needed unless covered from another source. Hardware, software, and
maintenance can be expected to cost some $5000 per year for each computer
in use, with as much more for communications costs. Substantial training
would probably be needed in the first year, suggesting that the full costs
of a national node could hardly be less than $50,000 per year, and could
easily be several times that amount. Partners' minimum communications and
staff costs can be much lower. However, at many partner sites, creating
or converting to digital data is a limiting factor to information exchange.
Taxonomic specimen records can typically be digitized at rates of only
a few specimens per hour, and ongoing field collecting is always underfunded.
Consequently, information systems and data management may be a small part
of the cost of linking to data from IABIN partners. However, without specific
participants, costs are difficult to estimate. [Without some needed information,
the Santa Barbara workshop estimated a minimum of about $500,000 for 2
years to construct a 6-country demonstration for the 8 databases recommended
for freshwater fish and vascular plants.]
VII. DATA STRUCTURES
There is a general consensus that IABIN should follow the lead of the CHM, BCIS, the U.S. National Biological Information Infrastructure (NBII), and many other biodiversity data centers, and begin by systematically cataloging available data on the pilot themes. More specifically, the experts meetings to date have recommend an incremental approach, beginning with simple registries of experts, species, data sets, and management or restoration projects, all built on the common framework of the Global Information Locator Service (GILS) specification (or equivalent catalog standards).
GILS provides the most basic elements needed to catalog a database or other "data object" consisting of environmental information. Its purpose is not to document the data fully, but rather to describe it in a way that helps users discover it and determine whether it is appropriate for their use. The description contains information on how to acquire the data or contact the data holder.
GILS is covered by agreements among the G8 nations and executive orders in the United States, and has become fairly standard within the environmental community (see http://www.gils.org and http://ceres.ca.gov/catalog for examples.) GILS is also a profile of Z39.50, a specification for data description that also encompasses MARC bibliographic database specification (used by the U.S. Library of Congress and many other libraries) and some detailed metadata structures, including the widely used Federal Geographic Data Committee (FGDC) metadata standard, which is required for geospatial data (maps and images) produced by any U.S. government program. As a result, server software exists to permit straightforward distributed access to GILS and other Z39.50 datasets, maintained on multiple servers, and therefore simplifies data discovery in a spatially fragmented organization.
GILS may be thought of as a series of structured tags labeling the information (or a pointer to where the information resides) covered by the catalog. Tag types include title, lead organization, contact person, geographical application, subjects of the information, time periods covered, and access instructions. Each of these elements may be used to specify particular attributes important to the project. For example, in all IABIN projects, one required subject designation might be the taxa covered (or the habitat types, or land use types). For the pilot project database types recommended in Section IV,
1) A registry species under management or of priority concern to partners,
2) Experts,
3) Alerts (new locations or outbreaks),
4) A GILS metadata registry of pilot project datasets,
5) A searchable GILS-like registry of funded management projects,
6) A distributed on-line mapping system,
7) A catalog of needs and opportunities for capacity building, and
8) A web-based compendium of available educational materials related to the pilot,
most can be thought of as GILS-like. The first two (species and people)
contain a subset of GILS. The distributed mapping system is more ambitious,
though a prototype using Z39.50 has been successfully developed by NABIN
and the University of Kansas. Information catalogs on invasive species
that are specifically GILS-compatible have been deployed for invasive species
alerts (Item 3 -- http://www.nfrcg.gov), and invasive plant management
projects (Item 5 --
http://endeavor.des.ucdavis.edu/weeds/), as
well as numerous databases and educational offerings (items 7 and 8 --
see http://www.gils.org for examples).
VIII. VOCABULARIES
GILS records or any other data integration strategy can only integrate data across multiple sources if entries denoting the same kind of object (e.g., the same species or habitat type) use the same name or code. Permitting data collectors to enter data without restrictions on their language leads to confusion. Consequently, it is important that the fields used to organize and search the data (species name, country, organization, etc.) only allow a limited list of terms selected from a standardized list (a "pick list," or "controlled vocabulary," or "thesaurus").
However, past experience has shown that imposing standardized names even for fairly standardized categories (such as species) is difficult and can essentially be impossible for more subjective classifications (such as habitats). In practice, vocabulary choices (or "thesauri") can only be successfully standardized among a particular professional community that chooses to adopt voluntarily a particular usage. However, it must be recognized that a particular use adopted by one community is likely to be unacceptable or of no use to professionals in another discipline. Consequently, it is important to designate not only the data entry (for example, species) but also the list from which it was taken (i.e., the taxonomic authority or reference).
GILS is very flexible about the thesauri used. It has specific places for geographical, methodological, and subject keywords, but the specific choices are not specified. For an IABIN pilot, the challenge is to agree on some keyword sets (locations, species, habitats, condition, etc.) that will adequately characterize the shared data for the professional community. For a given pilot, such as invasive fish, a standardized choice of species list (e.g., Eschmeyer, 1998) is achievable, but standardized choices for river names, collecting methods, study purposes, etc. would still have to be decided. It is unlikely that the choices (at least initially) could be effectively shared with other pilots (such as bleaching corals). Consequently, each pilot will need a working group or existing authority to choose thesauri to be used, and to provide updates and interpretations as needed. The thesauri used should always be recorded, named, and made available on-line. This task is further complicated, of course, in a multilingual environment, where links between terms in different languages must be maintained (presumably by reference to a common code).
A additional complication is that a user may wish to recover a record using a different thesaurus than a scientist originally used to develop the record. For example, a fish biologist might classify a location as being in a particular river, but a ministry employee might want to classify location by country or province. While such "cross-walks" can be done, it is important to retain the information about the original classification (rivers) as opposed to the necessarily less exact inferences of the appropriate entry from another thesaurus (provinces) chosen for other purposes.
IABIN probably should not try impose particular thesauri or vocabularies on all users. For example, it would probably be counterproductive to insist that all field botanists throughout the Americas accept the FGDC vegetation classification recently proposed as a standard for the U.S. However, IABIN as a whole or IABIN working groups can catalog and document vocabularies in wide use, support maintenance functions, and help develop cross-walks. Over time, it is likely that data contributors will choose to specify the most widely searched thesaurus:keyword treatments in order to raise the visibility of their published data, whether or not they also included other treatments they might prefer, or which might be common in their own user community. Permitting living thesauri, rather than static database fields, to do the "heavy lifting" of defining categorical descriptions of data sets (and other data objects) permits greater simplicity in the standard, flexibility to incorporate unanticipated data types, and perhaps easier adaptation to upcoming approaches (such as XML tools) to data warehousing and metasearching.
Once the vocabularies of "keywords" are established for particular pilots, populating most of the minimum database set will be straightforward. A number of candidate thesauri for environmental information exist, and these should be explored by individual projects before developing project-specific vocabularies. Examples of subject thesauri include the IUCN draft environmental thesaurus (T. Moritz, 1997 -- see http://www.biodiversity.org), the Global Change Master Directory (http://gcmd.gsfc.nasa.gov/), and the CERES Thesaurus project (http://ceres.ca.gov/thesaurus/), which derives environmental thesauri from Library of Congress keywords. Other kinds of (somewhat) standardized keywords include species names and chemical names; it would be desirable for IABIN projects to agree on particular master authorities for these types of names. For example, all fish names cold be cross-referenced to Eschmeyer's Fish of the World, or the Integrated Taxonomic Information System adaptation of that work, and all chemicals could use the Chemical Abstracts System code. Geographic keywords may be standardized on a coarse scale (countries, provinces, major rivers), but many geolocators (stream or village names) will probably have to be country-specific. Latitude-longitude designations are, of course, universal.
In summary, most of the databases recommended as core information for
the biological pilot projects represent GILS or GILS-like databases. These
are only easily searchable if they use specified vocabularies or thesauri.
The best choices for thesauri are not established (beyond, perhaps, for
species names), so they must be decided as an early activity of the country
representatives and other IABIN partners for each pilot. A working group
or organization charged with interpreting and updating the vocabularies
is also needed (and should presumably deliberate on-line rather than in
person). An initial workshop to make vocabulary choices and establish the
maintenance mechanism is probably needed for each pilot.
IX. MAPPING SPECIES OCCURRENCES: A DISTRIBUTED DATABASE APPLICATION
As just discussed, the first step in the incremental development of a distributed information system for the biodiversity information developed by IABIN pilot projects is metadata catalogs, following the GILS guidelines. This is exactly the same strategy being pursued by BCIS (http://www.biodiversity.org) for a broader overview of biodiversity information. As noted by IUCN (http://biodiversity.org/metadatabase.html), distributed queries from multiple sources require coordinated vocabularies -- one potential product of the cataloging efforts. However, where widespread vocabularies are already used, more ambitious mechanisms for data sharing are possible. In the pilot projects, point locations of species records provide such an opportunity, as Latin binomials and latitude-longitude locations can be reasonably interpreted from any source.
A system for interactive mapping of species locations and species ranges has been prototyped for museum records of North American birds by NABIN using the "Species Analyst" application developed by the University of Kansas with funding provided by the Commission for Environmental Cooperation (CEC, part of the NAFTA process). This pilot study provides an integrated viewer for the locations of bird specimens in the collections of several dozen museums in Mexico, the U.S. and Canada. This is done using a public-domain Z39.50 server (YAZ) and an application that, in essence, helps the curator of the digital collection database construct a table relating the format of the species and lat-long (plus collector and date) to the public (Z39.50) format posted by each server. This approach is independent of both the hardware running the YAZ server and the database software and structure being queried (as long as a Latin binomial, a lat-long, a date, and a collector are found in each record). Free client software on any machine connected to the Internet can then display bird records for many sites, filtered for species, date, collection, etc. The IABIN project has also successfully used these data reports and species distribution models, originally developed in Australia (the GARP and BIOCLIM projects), to predict species ranges in areas where biological surveys are missing or incomplete.
Recommendation 3: IABIN should explore adapting the IABIN
"Species Analyst" approach and software to develop a distributed species
mapping capability for species groups (invasive plant, invasive fish, declining
amphibians, selected declining pollinators, Hanna virus outbreaks), as
funds allow, as a prototype for a more ambitious effort to link biodiversity
databases using public-domain Z39.50 server technology.
X. Z39.50 PROFILES
It should be noted that GILS, the IABIN species-occurrence databases, and the full FGDC Spatial Metadata Profile (the U.S. standard for full documentation of maps and images, soon to be considered by the International Standards Organization as a defined standard) are all Z39.50 profiles, as is the bibliographic database standard (MARC) used by the U.S. Library of Congress and many other traditional and digital libraries. This family relationship among the important data types for networking biodiversity information suggest that the classes of databases recommended in Section IV could profitably be approached as a nested set of Z39.50 profiles, ranging from the most abstracted (a GILS catalog of databases) through highly detailed (full FGDC or IUCN metadata) or distributed (IABIN "Species Analyst"). As with the individual catalogs, full interoperability requires agreement on vocabularies (thesauri) as well as formats (database schema) and server/data transfer standards (Z39.50).
A review by the American Institute for Biological Sciences of biological metadata strategies for the NBII suggested a similarly tiered Z39.50 strategy for cataloguing and documenting biological information developed by NBII participants and partners.
As GILS for locator records and MARC for bibliographic records are already well established as both international standards and Z39.50 profiles, it would be efficient for the IABIN pilot databases and networks to follow these standards. The FGDC formulation for full geospatial metadata (detailed documentation for remote-sensing images and maps) is likely to become an ISO standard, and a biological variant may be proposed soon. Both are also Z39.50 profiles, and could also profitably be adopted by IABIN (though full metadata should be developed by the creator of the data, not the network apparatus).
All of these standards are flexible and evolving, so detailed compliance
is certainly less important than collecting and disseminating information
with Z39.50 content standards in mind. Similarly, while Z39.50 public domain
(YAZ) or commercial (Blue Angel) server software provides powerful technologies
for distributing environmental data over the Internet, the particular server
technologies will undoubtedly be supplanted by more powerful and flexible
systems, such as the XML-based data warehousing systems now under development
by most major software vendors (IBM, Oracle, Microsoft, etc.). Nevertheless,
upward compatibility will probably be straightforward if current GILS and
Z39.50-related standards and good programming practices are followed today.
XI. METADATA AND ELECTRONIC PUBLISHING
IABIN's role in developing metadata for the pilot projects is an opportunity not only to set and implement metadata standards for IABIN and partner databases, but also to serve as a forum to promoting quality electronic publishing. Evaluating (in many cases, peer reviewing) public-interest data and metadata, then cataloging and publishing the data (or pointing to other data publication sites) provides a public service akin to journal publication of scientific studies. Identifying a study or data set as valuable to or part of IABIN provides evidence of professional achievement to the contributors, and might be an important incentive for independent scientists or land managers to participate in an organized metadata system. Metadata of increasing complexity might be subjected to increasing levels of review and quality control, just as journal entries range from unreviewed letters to the editor through heavily edited feature articles. Metadata guidelines in this kind of strategy then might be profitably viewed as "instructions to contributors" rather than mandates.
Electronic publishing of data from IABIN participants and partners has many advantages. Many journals are unable or unwilling to publish extensive data because of the cost of printing copies on paper. Consequently, many of the large, synthetic studies most important to IABIN's mission are under-represented in the conventional scientific literature. Providing a prestigious outlet for biodiversity assessments will undoubtedly improve communication. Electronic publishing is also inexpensive, permitting the development of a healthy scientific press at national nodes in countries where publishing resources are limited.
An exact formulation of the editorial function is a charge to the organizers
of the pilot projects. However, the pilots probably ought to include both
unreviewed web sites, permitting remote users to voluntarily catalog information
resources of potential interest to IABIN partners (akin to letters to the
editor), and peer-reviewed sites where metadata descriptions and full data
access (as appropriate) to IABIN-sponsored studies are published. Peer
reviewed entries should have or point to full descriptions of methods,
data structure, and models used, and would normally be expected to make
the underlying data public. Unreviewed records might simply list title,
author, organization, subject, location, and access instructions, and could
cover incomplete or proprietary work as well as major scientific or management
initiatives.
XII. TECHNICAL CAPABILITIES NEEDED FOR IABIN SITES
The general consensus of past experts' meetings has been that IABIN needs to support participants at a number of levels of commitment and complexity.
Obviously, some compromises will be necessary, and many users without new computers or dial-in capability may have to be supported by e-mail, FTP sites, or packaged CD-ROMs.
Web Site Technology
Increasingly, the function of Web sites is defined by open standards
for communications, database content, and data transfer standards, but
is largely independent of the underlying platform. Existing sites of IABIN
partners that operate in ways consistent with the strategy described in
this document employ Windows NT, Linux, UNIX (multiple variants), and even
occasional Macintosh platforms, using a wide variety of website software.
Most modern relational database systems have the capability to serve the
kinds of databases described above over the Web, and at least Access (with
additional tools), MS SQL Server, Oracle, and Informix are widely used
for this purpose. Most on-line GIS sites either use ESRI products (e.g.,
MapObjects and Internet Map Server, often coupled with relational databases
such as Oracle) or custom-designed graphic-display programs. Several public-domain
Z39.50 server packages are widely used, but there are also better-supported
commercial choices.
In short, IABIN sites can probably meet the capabilities described in this document in many ways, and should choose on the basis of the expertise of their staff and their other computing needs.
Connectivity
In most locations, connection to the Internet is still provided through
the local or national telecom, though deregulation and privatization is
rapidly increasing options in many urban areas. The best choices vary with
country. In most urban settings, T1 or partial T1 (several hundred KB to
a couple of MB per second) lines are available, but may be prohibitively
expensive. (Prices in the U.S. are typically one to several thousand dollars
per month). In many places (but not the U.S.) ISDN (56KB/sec) is a reasonable
slower but lower cost alternative. However a number of new higher bandwidth
(> 1 MB/sec) technologies, including Digital Subscriber Lines and cable-modem
connections are becoming available, and should provide adequate speed for
all but the largest of sites at a much diminished cost. Within a year,
wireless high bandwidth connections at costs of under $100/month, often
using satellites, should be technically feasible throughout the hemisphere,
though regulatory approval may slow their availability in many places.
In short, cost and regulations, rather than availability, are likely to be the main limits to connectivity in the near future.
Security
Since IABIN proposes to enable voluntary publication of biodiversity
data, theft of data is not likely to be a major problem. However, some
data, such as locations of endangered species, may be sensitive, and IABIN
sites should certainly draw on the extensive experience of other sensitive-species
sites (WCMC, CITES, most national biodiversity sites) in deciding which
data to blur spatially or simply not post.
Protection against accidental or deliberate damage to the sites is more difficult. Existing biodiversity sites seem especially vulnerable to deliberate virus and "spam" attacks, and it is likely that any large site should use one of the many excellent commercial security suites (firewall, anti-virus, etc.) to help protect the site. Major vendors include Network Associates, Checkpoint Software, and Symantec, but many smaller companies offer effective products. Limiting public access to pilot data, at least at the initial stage, by password access is straightforward, but may nor may not be desirable if improved communication is a principal objective. One approach that has been effective in many countries is to establish "research web sites" at a university or NGO during the development phase, then to open the official and accountable government version after the research site is fully operational and de-bugged.
Operators of public databases are always well advised to keep the official, archival copy of any data on a machine not accessible to the public, then to periodically post copies on the public site. Similarly, IABIN may choose to create "mirror" sites in multiple countries to provide more reliable access. Without automated mechanisms to update "mirror" sites, this is more expensive than it sounds, since "version-control" problems inevitably arise from uncoordinated updates at multiple sites. The San Diego Supercomputer Center is currently researching methods for automated mirroring of large environmental datasets, and may provide appropriate (and free) technology for doing so during the duration of the pilot studies.
Search and Query Technology
A number of engines for querying complex environmental data sources
and documents are currently in use in the established biodiversity database
centers. These include both general purpose search engines (Harvest, Alta-Vista,
etc.) and a variety of project-specific query tools. In practice, general
keyword searches can always be done, but the principal limitations on structured
cross-site or cross-database queries are less technological than terminological
- complex queries only work effectively when vocabularies have been standardized.
Extended Markup Language (XML) or related "document-centric" technologies
may provide a means for more generalized access to environmental information.
XML may be looked at as a superset of HTML, with more structured "metatags"
categorizing each information "object" (page, document, map, herbarium
sheet, etc.). If vocabularies are standardized in the (GILS-style) metadatabases
discussed above, those metadatabases can be used to automatically generate
easily searchable "metatags" for XML. XML, in turn, can be viewed ("parsed")
with standard Web browser technology, or may be used to generate conventional
HTML directly. One school of thought is that complex databases will continue
to be developed in relatively conventional relational (or relational-object)
database formats (Oracle, Informix, etc.), which will in turn generate
XML "middleware." The XML can then be browsed or queried more easily by
non-experts than the full-blown (SQL or equivalent) database. In any case,
XML-like technology is being rapidly developed by most of the world's largest
software companies and is central to their strategies for "data warehousing."
Consequently, the capabilities of future XML are unpredictable, but it
is sure to be more powerful and flexible than it appears at present. At
this point, IABIN pilot projects should at least consider representing
their data in an XML format, but it is probably premature to start designing
specific search or query strategies using XML technology.
XIII. CONCLUSIONS
The technical approaches to IABIN pilot projects should closely reflect those taken by IABIN as a whole, as well as other partners in the Clearinghouse Mechanism, BCIS, MABNet, and other international biodiversity networking initiatives. As in any technical undertaking, the first step is to inventory available information, technology, and information needs. It is highly desirable to do so in a formal and organized fashion, developing consistent catalogs of experts, partner organizations, species and resources of concern to those experts and organizations, existing data sets, major existing scientific and management projects, and various kinds of resulting synthetic data, educational materials, and capacity-building opportunities. This would an almost insurmountable task if attempted for all important biodiversity resources in the hemisphere. Consequently, these catalogs should first be developed for a small number of well-defined environmental issues of particularly high visibility and importance. These are the IABIN pilot projects. If carefully done, the information systems developed for those projects should be widely applicable to other biodiversity issues considered by IABIN.
As a practical matter, a generally agreed format for digital catalogs (or "metadatabases") in the environmental area has been adopted by many of IABIN's partner organizations. This is the G8-standard GILS specification. GILS fits data types defined in a number of pilot proposals (e.g., invasive species) well.
The power of metadata system, however, is derived less from the format of the catalogs than it is from consistent use of language in describing the information available. "Controlled vocabularies" are recognized as a major research issue in metadata development by many IABIN participants and partners, and a number of draft proposals for shared environmental "thesauri" should be examined as part of each pilot project. However, as a practical matter, particular thesauri have been adopted as useful only by fairly well defined professional "communities", and it is probable that IABIN will have to track different kinds of data through different vocabularies useful to different sets of participants. It is the social process of developing a consensus on language, rather than the technical representation of that language, that is the limiting process on the coordination of environmental data across multiple countries, organizations, and professions. Presumably, the pilot projects will provide methods and insight on how to harmonize language on fairly well-defined topics.
Once the vocabulary strategy is under control, a hierarchical set of network nodes, ranging from hemispheric sites, through national representatives, specialist partners, and remote users, can be constructed using public domain information standards and off-the-shelf hardware and software. Again, the technical challenges are less imposing than the human challenges of how to provide incentives for busy professionals to participate, how to provide the training and funding they need to develop and manage nodes, and how to coordinate their activities. However the nature of maturing World-Wide Web technology and structured document (XML-like) expression of information are well suited to the complexity and informal nature of environmental networking, and it is likely that IABIN can easily model the technical solutions on the basis of the human innovations it is fostering.
APPENDIX
List of Acronyms
BCIS Biodiversity Conservation Information System
BDT Base de Dados Tropical
CEC Commission for Environmental Cooperation
CHM Clearing-House Mechanism, of the Convention on Biological Diversity
FGDC Federal Geographic Data Committee
GILS Global Information Locator Service
IABIN Inter-American Biodiversity Information Network
MABNet Man and the Biosphere Network
MARC Machine-Readable Cataloging
NABIN North American Biodiversity Information Network
NBII National Biological Information Infrastructure
NGO Non-Governmental Organization
XML Extended Mark-up Language