The availability of electronic resources, particularly those on the Internet and the World Wide Web, presents a great challenge to library and information professionals. How to describe them and how to provide subject access to them have been topics of intense scrutiny and discussion. Before we approach these questions, let us consider briefly the nature of Web resources. The high volume and phenomenal growth rate are their most distinctive characteristics. In addition, Web resources are also amorphous, volatile, ill-defined, not self-contained, and unstable.
The existing bibliographic control apparatus provides a framework for resource description and knowledge organization, while subject access tools provide users with a way to navigate the system and find what they need. This paper will focus on subject vocabulary used for retrieval of Web resources. I would like to start with what are generally perceived to be the functions of subject access tools. Effective subject access tools should be designed:
- to assist searchers in identifying the most efficient paths for subject retrieval
- to help users focus their searches
- to enable optimal recall
- to enable optimal precision
- to assist searchers in developing alternative search strategies
- to provide all of the above in the most efficient, effective, and economical manner
To implement mechanisms and devices that would fulfill the requirements stated above, certain operational requirements should be borne in mind. The subject data in the metadata record should be adaptable and flexible in order to accommodate simple descriptive schemes such as the Dublin Core as well as elaborate ones such as library cataloging rules. Schemes for supplying subject data in the metadata record should:
In particular, semantic interoperability mandates that the vocabularies used in different communities should lend themselves to harmonization, in order to enable users to search across a wide range of metadata schemes. International consensus would require that, ideally, the methods and tools chosen should be capable of cross linking and mapping among different vocabularies and between different languages.
- be simple and logical so that it requires the least effort on the part of the user to understand and to use.
- be easy to apply and to maintain;
- be intuitive so that sophisticated training in subject indexing and classification, while highly desirable, is not required for implementation
- be flexible and adaptable, meaning that the methods and tools and procedures are scalable and extensible in order to be viable in environments ranging from the simplest application to the most complex and detailed implementations;
- be appropriate to the specific discipline and subject, and to the domain of implementation such as libraries, museums, archives, information services, the scientific community, and personal knowledge management; and,
- be interoperable across disciplines and in various knowledge discovery and access environments, not the least among which is the online public access catalog (OPAC).
To address these issues, the Association of Library Collections and Technical Services of the American Library Association established a Subcommittee on Metadata and Subject Analysis in 1997 to study subject vocabularies and organization schemes for electronic resources. The charge to the Subcommittee was:
"Identify and study the major issues surrounding the use of metadata in the subject analysis and classification of digital resources. Provide discussion forums and programs relevant to these issues."
Initial deliberations focused on the Dublin Core Metadata Scheme, but later discussions were broadened to include metadata schemes in general.
The Subcommittee has completed its deliberations and has presented its recommendations in a report (ALCTS/CCS/SAC/Subcommittee 1999). The Subcommittee considered both verbal representation (i.e., the desirability of controlled vocabulary) and classification data. This document and the papers from the recent Library of Congress Bicentennial Conference on Bibliographic Control for the New Millennium (Bicentennial 2000) will be used as the basis of my presentation.
As a background and framework of my presentation, I would like to borrow some of the common themes that emerged from the conference papers. Those that would have an impact on subject access tools include the following:
In addition, we need to consider also some of the recent developments in the field of subject access. These include automatic indexing, mapping terms and data from different sources, and integrating different subject access tools
- We should consider a tiered approach to bibliographic control of networked resources selected for access through the library, in concert with selection and collection development policies.
- There is a general recognition that multiple standards/multiple schemes are being used for different purposes, sometimes within the same retrieval system.
- We need to be concerned about efficiency (while retaining quality) and capacity for handling large quantities of resources.
- Subject access tools should be scalable and extensible in order to accommodate the need for different degrees of depth and different subject domains;
- Interoperability is an important consideration.
- For users, we need a one-stop, seamless access.
Now, let us look at some of the specific issues explored by the ALA Subcommittee. The first question regarding verbal representation is: To control or not to control the vocabulary? There are basically three options:
Following are the ALA Subcommittee's recommendations:
- Keywords (free-text) only
- Keywords and controlled vocabulary
- Controlled vocabulary only
With these recommendations in mind, let us consider the reasons for using a controlled vocabulary, which are:
- A combination of keywords and controlled vocabulary should be used to allow users the choice of simple free-text indexing as well as complex controlled vocabulary indexing. (3.1)
- In order to achieve the desired level of specificity, controlled vocabulary terms assigned to the metadata record could be supplemented and complemented by keywords and other subject-related elements, such as title, abstract, statement of content, etc. (188.8.131.52)
- In the Dublin Core metadata record, the Subcommittee recommends the inclusion in the SUBJECT element of both free-text and controlled terms, where appropriate and feasible, in order to achieve optimal recall and precision in retrieval. (5.1)
In considering the use of controlled vocabulary, it is important to bear in mind both the advantages and the disadvantages of controlled vocabulary. The advantages of controlled vocabulary include:
- To provide for consistent representation of subject matter, thereby avoiding subject dispersion, at input (indexing) and output (searching) by control of synonyms, near synonyms, and quasi synonyms and by differentiation of homographs
- To facilitate broad (generic) searches by bringing together in some way terms that are semantically related, including both logical or inherent (invariable, e.g., aluminum and metal) and perceived relationships (e.g., aluminum and beer barrels or soft drink containers)
On the other hand, the disadvantages of using controlled vocabulary should be recognized also:
- Control for synonyms
- Control for homographs
- Displaying term relationships
- Giving high precision in retrieval
- Easy to do generic searches
The best approach is probably a combination of free-text and controlled vocabulary, because they complement each other.
- Lack of flexibility
- Requiring highly trained personnel
- Subject to human error in indexing
- Slow in incorporating new concepts
- Not always providing the desired level of specificity
With the availability of automatic indexing methods, free-text keyword access has become predominant on the Web. What role can controlled vocabulary play in this environment? To answer this question, I would like to focus first on several aspects of controlled vocabulary. First, we examine the expanding roles of controlled vocabulary. The value of controlled vocabulary in surrogate-based indexing and retrieval systems (particularly in the OPAC and other information and retrieval systems) is well recognized. Even in text-based or content-based indexing and retrieval, keywords can be supplemented with terms "borrowed" from a controlled vocabulary to improve retrieval performance, for example, as a query-expansion device to enhance the "more like this" feature using controlled terms as well as terms extracted from texts, based on various ranking algorithms. Results of keyword searching on the Web are often less than satisfactory. Controlled vocabulary can be used to complement uncontrolled terms and terms from lexicons, dictionaries, gazetteers, and similar tools, which are rich in synonyms, but often lacking in relational terms.
Having discussed the role of controlled vocabulary, we turn to the next question: Which controlled vocabulary should be chosen in a particular circumstance or environment? Based on the ALA Subcommittee's deliberations, there are a number of questions relating to the choice of a controlled vocabulary or multiple vocabularies:
To answer these questions, the Subcommittee recommends the following (ALCTS/CCS/SAC/Subcommittee 1999):
- Use of existing schemes
Are existing schemes suitable for use on the web? If not,
- Adaptation/modification of existing systems
What needs to be modified, and how?
Who will be responsible?
- Development of New scheme(s)
Who will develop and maintain it/them?
- A single controlled vocabulary - "one size fits all?"
- General vs. specialized schemes
- Multiple vocabularies
How do we harmonize terms from different vocabularies?
- Exploring the possibility of developing a general metathesaurus (i.e., a thesaurus of thesauri, cf. the medical metathesaurus (National Library of Medicine (2000)).
Is a metathesaurus covering all subjects feasible?
Related to the second recommendation mentioned above, the question may be raised as to why, in the networked environment, we want to continue using LCSH, which was originally designed for the card catalog. The reasons are many:
- For the sake of semantic interoperability, the Subcommittee recommends adopting an existing vocabulary or vocabularies with or without modification. (5.1.1)
- The adoption or adaptation of Library of Congress Subject Headings (LCSH) or Sears List of Subject Headings (for subject representation on a broader level) as the basis for subject data in the Dublin Core metadata records for a general collection is recommended. (184.108.40.206)
- Use of multiple vocabularies should be accommodated.
- For a general vocabulary covering all subjects, the Subcommittee recommends the use of LCSH or Sears with or without modification. (3.2.1)
- Criteria for choosing specialized vocabularies should be based on subject matter, the intended audience, term specificity, and syntax. (220.127.116.11)
- The development and refinement of methods for harmonization of subject terms from different controlled vocabularies should be undertaken; and,
- The investigation of the feasibility of developing a general metathesaurus or expanding the medical metathesaurus to include indexing terms covering all subject areas should be encouraged. (3.2.4)
Projecting ahead, we see the potential of LCSH also serving as the basis for creating multilingual thesauri, as the basis for building subject or domain-specific thesauri, and as a framework for implementing, perhaps some time in the future, even a comprehensive metathesaurus.
- LCSH is a rich vocabulary covering all subject areas, easily the largest general indexing vocabulary in the English language;
- there is synonym and homograph control;
- it contains rich links (cross references indicating relationships) among terms;
- having been translated or adapted as a model for developing subject headings systems by many countries around the world, LCSH is a de facto universal controlled vocabulary; and,
- retaining LCSH as subject data in metadata records would ensure semantic interoperability between the enormous store of MARC records in the library catalogs and metadata records prepared according to various standards.
However, to enhance the performance in their expanding roles, certain issues relating to existing controlled vocabularies, including Library of Congress Subject Headings, need to be re-examined. These issues include:
Structural aspects relating to semantics concern term specificity (i.e., What level of specificity is most desirable and suitable?), synonym and homograph control and term relationships. The ALA Subcommittee recommends (ALCTS/CCS/SAC/Subcommittee 1999):
- Structural issues
- Term relationships
- Application issues:
- Existing vs. modified or new schemes
- Guidelines for implementation
- System design issues
Syntax (how words are put together to form index terms) issues concern the use of full strings or single-concept descriptors. It is recommended by the ALA Subcommittee that:
- Each implementing agency should establish policies regarding the appropriate level of subject representation for its collection. At the appropriate level, the most specific subject terms provided by the chosen controlled vocabulary should be assigned. (5.1.2)
- Synonyms should be handled by system design implementation of the controlled vocabulary or thesaurus. If this is not available, an alternative is to include all identified synonyms and related terms, along with the keywords, in the metadata record. (18.104.22.168)
Regarding the application of controlled vocabulary, the following issues were explored by the Subcommittee: consistency, summarization vs. exhaustive indexing, placement of non-topical data, and pre-coordination vs. post-coordination, with the recommendation that:
- The metadata record, and the subject element in particular, should be as simple or as complex as desired. Trained catalogers may choose to continue to apply LCSH to the metadata records in the same manner as those assigned to MARC records. For those not trained in subject cataloging, the Subcommittee recommends a simplified syntax. (3.2.3)
- With regard to syntax, the use of full LCSH subject strings, if feasible (i.e., if time and trained personnel are available), particularly in the OPAC environment, should be encouraged. For the Dublin Core, the Subcommittee endorses the use of other elements (type, coverage) in addition to the SUBJECT element to accommodate different facets related to subject: topic, place, period, language, etc. Deconstructed subject strings should be so designated. (5.1.4)
Within a specific digital collection or project the application of subject analysis should be consistent; in other words, the same semantics and syntax should be applied throughout. Compatibility with other metadata schemes is also desirable. When a controlled vocabulary is used, the version of the vocabulary should be indicated along with the date on which the subject data are created. (5.1.3)
Focusing on the Library of Congress Subject Headings system, several possible approaches may be considered:
At the heart of the syntax issue is the representation of complex subjects through combination, or coordination, of terms representing different subjects or different facets (defined as families of concepts that share a common characteristic (Batty 1998)) of a subject. There are two aspects of syntax: term construction and application syntax. Term construction, i.e., how words are put together to represent concepts in the thesaurus, is a matter of principle; while application syntax, i.e., how thesaurual terms are put together to reflect the contents of documents in the index, is a matter of policy. Enumeration and faceting are aspects of term construction, while pre-coordination and post-coordination relate to application syntax.
- Separating syntax from semantics
If we separated semantics (i.e., thesaural terminology) from heading formation (or application syntax), we could apply existing subject headings schemes such as LCSH with a simplified syntax by breaking out the non-topical elements (geographic, chronological, and form subdivisions)
- Designing a series of flexible syntaxes, from the most complex (e.g., full-string approach) to the simplest (descriptor-like terms), allowing different degrees of sophistication
- Decision regarding pre-coordination vs. post-coordination
Term combination can occur at any of three stages in the process of information storage and retrieval: during thesaurus construction, at the stage of cataloging or indexing, or at the point of retrieval. At this point, let us review the definitions of some of the terms used in this regard:
Enumeration. When words or phrases representing different subjects or different facets of a subject are pre-combined in the vocabulary list (i.e., at the point of thesaurus construction), we refer to the process as enumeration.
Synthesis. When words or phrases representing different subjects or concepts are listed individually in the vocabulary list to be combined later, we refer to the practice as synthesis.
Faceting. When only the basic core vocabulary defined in distinctive categories are given in the list, we refer to the practice as faceting.
Pre-coordination. When term combination occurs at the stage of indexing or cataloging, we refer to the process as pre-coordination.
Post-coordination. In contrast,post-coordination refers to the combination of terms by the searcher at the point of retrieval.
Because of various search environments and different needs of diverse user communities, a vocabulary that is flexible enough to be used either pre-coordinately or post-coordinately, i.e., able to accommodate different application syntaxes, from the most complex (e.g., full-string approach in OPACs) to the simplest (descriptor-like terms used in most indexes) and allowing different degrees of sophistication, would be the most useful and viable. A totally enumerative vocabulary is by definition pre-coordinated. On the other hand, a faceted controlled vocabulary--i.e., a system that provides individual terms in clearly defined categories, or facets-- may be applied either pre-coordinately or post-coordinately. A faceted scheme hence is more flexible. Whether a pre-coordinate approach or a post-coordinate approach is used in a particular implementation is a matter of policy and is agency-specific.
The advantages of a faceted controlled vocabulary can be summarized as follows:
A re-examination of a controlled vocabulary such as LCSH should begin with semantic issues that have to do with the vocabulary itself and apart from application policies. Aspects of particular concern that need close scrutiny and re-thinking include:
- simple in structure
- amenable to software applications (Batty 1998)
- flexible in application
- interoperable with most other modern indexing vocabularies
- easy to maintain
It is a question of enumeration vs. faceting or synthesis.
- principles of term selection, i.e., issues relating to the scope and depth, or how specific the terms need to be;
- enhanced entry vocabulary is an important property of a controlled vocabulary (Prominent scholars, especially Elaine Svenonius (Svenonius 2000), have written extensively on this issue)
- rigorous term relationships, combined with or supported by a classificatory structure; and,
- term construction, particularly the representation of complex subjects.
Currently, LCSH is a partially faceted and partially enumerated scheme. The list contains many pre-established multiple-concept headings and heading/subdivision combinations. On the other hand, the distinctive categories of geographic, chronological, and form and topical free-floating subdivisions reflect the characteristics of a faceted scheme. The question is: should or could LCSH become a more faceted scheme? While a faceted scheme is more flexible, one should also be aware of the disadvantages of a faceted vocabulary. A faceted LCSH would lose the following features:
On the other hand, there are a number of advantages of a faceted LCSH. They can be summarized in these terms:
- Loss of cross references between and among multi-faceted headings that are now enumerated in LCSH (Mann 2000)
- Loss of the recognition (suggestibility or serendipity) factor in online thesaurus displaying and browsing.
On the last point, while many subject domains and disciplines such as engineering, art, and biomedical sciences have their own controlled vocabularies, many specialized areas and non-library institutions still lack them. These include for-profit as well as non-profit organizations, government agencies, historical societies, special-purpose museums, consulting firms, fashion design companies, to name a few. Many of these rely on their curators or researchers, who typically have not been trained in bibliographic control, to take responsibility for organizing Internet resources. Having a comprehensive subject access vocabulary to draw and build upon would be of tremendous help in developing their specialized thesauri.
- A faceted LCSH would be able to accommodate both pre-coordinate and post-coordinate indexing and retrieval.
(It is important to stress that a faceted vocabulary does not preclude pre-coordination. Pre-coordinate indexing is prevalent in OPACS, e.g., subject heading strings in MARC records. Post-coordinate indexing, on the other hand, is more common among information storage and retrieval systems. It is also the application syntax adopted in OCLC's FAST project proposed for use as a method of providing subject data in Dublin Core records).
- It would enable a tiered approach to allow different levels of complexity in subject representation.
- It would facilitate mapping of subject data and cross-database searching.
- A faceted vocabulary is easier and more economical to maintain than an enumerated one.
- A faceted LCSH would be more amenable to computer-assisted indexing
- It would be able to accommodate different retrieval models.
- It would facilitate the creation of domain- or discipline-specific thesauri.
Now let us examine the application issues, particularly the practice of pre-coordination vs. post-coordination, i.e., how index terms from the controlled vocabulary are put together and assigned to bibliographic or metadata records. Again, there are advantages and disadvantages to each approach. The pros and cons of the post-coordinate approach can be summarized as following:
- Loss of precision in representing many complex and multi-faceted subjects
- Loss of precision in those subject heading strings where order of and relationships between facets affect the meaning, particularly in systems where a faceted LCSH is applied post-coordinately
- Loss of precision due to improper combination of facets in retrieval
- Loss of the ability to show complex subjects in index display and browsing
Traditionally, LCSH has been applied pre-coordinately. For MARC records in library catalogs, this practice continues. Given the characteristics of Web search engines and the nature of Web resource retrieval, particularly in view of the lack of the ability to accommodate full-string browsing and searching, and a shortage of trained personnel, post-coordination may be a more viable approach under certain circumstances.
- Simplifies subject heading construction
- Simplifies index display
- Simplifies authority control
- Is easier to use and to understand
- Is interoperable with other controlled vocabularies
The FAST Scheme
Even as the Subcommittee was completing its deliberations and finalizing its report, a scheme reflecting many aspects of the Subcommittee's recommendations was being developed. The FAST (Faceted Application of Subject Terminology) scheme (Chan et al. 2001), sponsored by OCLC, is a controlled vocabulary based on LCSH but with a more faceted and post-coordinate approach. It has been in development since 1998, and upon completion, will be implemented in Dublin Core metadata records in OCLC's CORC (Cooperative Online Resources Catalog) system.
Controlled vocabulary has served the library and information community long and well. For more than a century, it has provided a reliable means of ensuring both recall and precision in information retrieval. Even in the Web environment where keyword or free-text approach is predominant, controlled vocabulary can play an important role.
ALCTS/CCS/SAC/Subcommittee on Metadata and Subject Analysis. (1999). Subject Data in the Metadata Record: Recommendations and Rationale: A Report from the ALCTS/CCS/SAC/Subcommittee on Metadata and Subject Analysis. http://www.ala.org/alcts/organization/ccs/sac/metarept2.html
Batty, David. (November 1998) WWW -- Wealth, Weariness or Waste: Controlled Vocabulary and Thesauri in Support of Online Information Access. D-Lib Magazine. http://www.dlib.org/dlib/november98/11batty.html.
Bicentennial Conference on Bibliographic Control for the New Millennium: Confronting the Challenges of Networked Resources and the Web, Sponsored by the Library of Congress Cataloging Directorate, November 15-17, 2000. (2000). (http://lcweb.loc.gov/catdir/bibcontrol/)
Chan, Lois Mai, Eric Childress, Rebecca Dean, Edward T. O'Neill, and Diane Vizine-Goetz. (2001). A Faceted Approach to Subject Data in the Dublin Core Metadata Record. Journal of Internet Cataloging. 4(1/2):35-47.
Mann, Thomas. (2000) Is Pre-coordination Unnecessary in LCSH? Are Web Sites More Important to Catalog than Books? Bicentennial Conference on Bibliographic Control for the New Millennium: Confronting the Challenges of Networked Resources and the Web, Sponsored by the Library of Congress Cataloging Directorate, November 15-17, 2000. (http://lcweb.loc.gov/catdir/bibcontrol/)
National Library of Medicine. (February 2000) Fact Sheet: UMLS ® Metathesaurus ® http://www.nlm.nih.gov/pubs/factsheets/umlsmeta.html
Svenonius, Elaine. (2000) The Intellectual Foundation of Information Organization. Cambridge, MA: MIT Press, 2000.