Cataloging the Web

	Sign In Sign-Up

Cataloging the Web
Making the WWW More Accessible

2. Cataloging the Web: Subjects
 
Eyler Coates
(4) SUBJECTS. This would require the existence of a standard list of subject headings, probably made available by the Search Engine (see below), from which Webmasters could select (with helps) up to five appropriate headings for their own page and put them in a tag:
<meta name="subjects" content="--------">
Rather than use something as complicated as the Library of Congress Subject Headings, something less detailed such as the Sears Subject Headings, would probably be sufficient for the WWW.
 
Sonja Scarseth
Mr. Coates goes on to suggest that Webmasters develop their own keyword lists and that humans add appropriate cross references. I would suggest that their efforts be spent, rather, on working out ways of turning LCSH into keywords, and take advantage of all the work that has already been done on cross referencing. Why reinvent the wheel?
 
Robert Cunnew
I'm not familiar with Sears, but isn't it-- like LC-- precoordinate? Please don't let's suggest that the Web is cluttered up with nineteenth century notions of subject access designed for catalogue cards. Simple postcoordinate terms are what is required, e.g. term 1, Libraries, term 2, United States, not "Libraries--United States" or whatever Sears has to offer. Search engines may not be perfect but they can do Boolean, even if it's often implicit rather than explicit.
Given the undesirability of precoordination in subject indexing, I wonder whether there is a need for (5) Forms, taken from a short list of appropriate terms, e.g. information service, promotional material, images, sounds, software (form not subject, i.e., downloadable), news, directory, discussion forum ...
 
Eyler Coates
I regrettably must confess that I am unfamiliar with your frame of reference here and am uncertain of your meaning. I had suggested five subject headings as a maximum, though if the Keyword option were selected, it might seem that more than five would be required. Are we talking about the same thing?
 
Robert Cunnew
I was suggesting that we need to categorize Web pages by the form the information is in (e.g. "Images") as well as the subject of the information (e.g. "Hale-Bopp"). Systems like LCSH mix the two functions but if you're using postcoordinate indexing you really need to separate them.
 
Eyler Coates
Is it not true, however, that the keywords (which can also be phrases) need not be in a separate META category? As long as the terms designating the form, subject, or whatever, are distinctive indicators, could they not be intermixed? Thus, a Webmaster who uses the keywords "Marie Antoinette, biography" would, with two keywords, establish the work in three categories.
 
Michelle Robertson
I don't see how keyword language can be controlled to that extent successfully, without asking Webmasters to research their subject terminology and essentially do the same work catalogers do when assigning subject headings. It is very easy to add more terms as "see" or "see also" references; it will be impossible to divide up multiple meanings of the same word after it has been used by Webmasters to define their sites. Since the proposal seems contingent on their being able to use the vocabulary they choose, I think an attempt to separate terms after their assignment to a site would be ill-advised.
The webmaster for a site that consists of an extensive collection of American literature will want to use the "literature" keyword. The webmaster for a site that consists of information about American literature (historical, critical, whatever) will want to use the "literature" keyword as well. People who only want to find one of these two things will be enormously frustrated to have to wade through a large selection of both kinds of sites.
If you control the vocabulary to the point that people have to use unnatural words to get what they want (especially on the web), they will be discouraged from using that search strategy. The abovementioned metaliterature site could include "literary criticism" "literary history" "book reviews" "literature bibliography" etc., but this is putting an undue demand both on the site to provide all these specific terms that differentiate its content from literature itself, and on the searcher to come up with all the terms he needs. The literature/criticism problem is among a host of similar conflicts. We just don't use different words to distinguish between form and content in English, so any attempt to force the vocabulary to do so will be stilted and confuse the user.
A tongue-in-cheek proposal: we could just add "meta" to all the terms to describe "aboutness." "Metaliterature" for criticism and "metametaliterature" for a bibliography of literary criticism... and dare I mention "metametametaliterature" for a catalog record of the bibliography of critical works... these are just not attractive to look at, and they don't really mean anything.
Unless there is another proposal that would solve this problem, having different "META" tags for content and form are essential if any sense is to be made out of the keywords I've mentioned.
 
J. McRee Elrod
The subjects of web sites are even more up-to-date and various than those of books. My experience of Sears is that it is too limited even for books. I would suggest LCSH supplemented by the most recent annual cumulation of the appropriate Wilson periodical index.
Another option, of course, would be keywords.
 
Eyler Coates
Both Mac and Robert Cunnew address this important problem, though from slightly different points of view. Robert Cunnew suggests a rejection of both the Sears and LCSH in favor of subject headings that might permit a Boolean search, which I assume would mean the use of keywords. This is a crucial part of the system, and the point that Mac makes about the subjects of websites being "more up-to-date and various than those of books" is a salient one. If this general scheme were adopted, it seems that the "keyword" option might be the most desirable. It may be that a Search Engine could provide a standard list of Keywords created from those actually used by Webmasters, and that this list could serve as a reference list to maintain a level of uniformity. Webmasters could then create new Keywords if there were none on the list adequate for their needs. Thus, the Keyword List would be constantly brought up to date. Such a list would also be useful for user/researchers while browsing. In addition, a human being (cataloger) could monitor the list, creating appropriate "See Also" references as part of the Search Engine's offerings and perhaps even making redundant Keywords, created by Webmasters (who are necessarily amateur catalogers), all refer to the same items.
 
Ruth Lewis
This is a controlled vocabulary, rather than keywords, is it not? I always understood keywords to be taken from the text/web page unaltered. Whether pre- or post-coordinated, I think some kind of controlled vocabulary is the only way to get decent subject searching.
 
Eyler Coates
Keywords are contained in a standard META tag, which is in the "head" of a webpage. They are frequently used by Search Engines as a kind of subject heading and are, indeed, normally taken from the web page unaltered. If one Webmaster uses the keyword "airplane" and another uses the keyword "aircraft," a Search Engine right now would treat them as two entirely separate terms. What I am suggesting is that, under the system proposed here, the Search Engine's Cataloger would equate those two terms in the Search Engine's keyword vocabulary, so that even though two different Webmasters used two different terms, either term would refer to the same list of subject entries. And if a Surfer entered either term in his search, the Search Engine would display all entries using either term. Thus, in a sense, you are right: it ends up being a controlled vocabulary, even though it is entered by the Webmasters as an uncontrolled vocabulary. The control comes about because the Cataloger makes all those various terms as though they were the same in the Search Engine's vocabulary.
In this way, the equivalent of "See" references would be created by the Webmasters, but their combination into ONE real reference would be supervised and maintained by the professional. Interestingly enough, unlike the standard card catalog "See" references, these "See"-like references would all be considered equally authentic. This, then, would be a dynamic system that would be in a constant state of evolvement.
One problem with this is, a list of equivalent keywords is not presently available on the Web, although it probably would not be too difficult to establish a list that Catalogers and Webmasters could use. If some public institution supplied such a list on the WWW, all search engines could use it, and everybody would benefit from the uniformity. Such a list would serve as a guide to the Cataloger and a suggestion to the Webmaster. But even if the Webmasters never referred to the list, the system would function perfectly well as long as the Cataloger stayed on top of the terms that were actually being used and converted them into their equivalent.
Note that the Search Engine itself would contain the equivalent of "See" and "See Also" references. "See" references would automatically bring up the referenced materials in most cases, whereas "See Also" references should present options to the user. In some cases, a search could bring up a request for more specific attributes from the user. All of these would be elements built into the search engine's program.
 
Steve Shadle
And can anyone out there tell me what the current status of the Dublin Core (and other metadata schemes) are in terms of development, establishment and actual acceptance in the community?
 
Hal Cain
The latest I can put my hand (or mouse pointer) on is at:
http://www.oclc.org:5046/research/dublin_core/ but there must be other material too.
 
Rebecca S. Guenther
See the paper submitted to the USMARC Advisory Group at Midwinter 1997:
"Discussion Paper No. 99: Metadata, Dublin Core, and USMARC: a review of current efforts"
gopher://marvel.loc.gov/00/.listarch/usmarc/dp99.doc There was another workshop in Canberra, Australia in March 1997 that worked further on refinement of the data elements and syntax for embedding META tags in HTML.
 
Eyler Coates
The Dublin Core has devised a system with 15 different META tags, as I recall. It would require the entire system to be administered by trained Catalogers, and for that reason I doubt if it would address the problem of explosive growth as we are trying to do here.
 
Roman S. Panchyshyn
There are three distinct approaches that I have encountered in my approach to this question. First, there is the Intercat project at OCLC and the emergence of the 856 MARC field, which has allowed cataloguers to provide access to electronic information via MARC records. While this works fairly well, first generation automation systems usually do not have web-based OPACs which allow for "hot links" to the items themselves. The second and third approaches involve other forms of metadata. There is the approach which uses Text Encoding Initiative (TEI) headers for electronic documents which have been marked up using SGML. The Library of Congress has been working on software which would allow for conversion of TEI headers to MARC and vice-versa. They are probably better suited to comment on the current status of this project than I am. Third, is the Dublin Core approach. This approach would see creators of electronic documents embed a core set of elements in the meta tags of HTML documents. These elements, as has already been pointed out on this list, have also been mapped to USMARC in a recent MARBI DP. For more information on the DC, I would suggest readers contact Stuart Weibel at OCLC. He is one of the major developers of this initiative.
My opinions on this issue are very basic ones. If no standards for cataloguing and providing access to these materials is decided upon, we may be faced with an enormous "virtual backlog" of Internet resources. If cataloguers fail to provide access to these materials, then we are not providing the best service possible to our users.
 
J. McRee Elrod
Perhaps a fourth could be added to these: bookmarks on a library homepage. Most I have seen segregate bookmarks into those intended for patrons, and those intended for staff (mainly technical services staff).
Why not have library homepages for each subject division, e.g., social sciences, humanities, and sciences, with relevant bookmarks?
The major difficulty I have with creating MARC records for many of these sources is that they are in such a state of flux. The bookmark list(s) could be seen as analogous to vertical or pamphlet files, where we put material of current interest which is too ephemeral for full cataloguing.
 
Go To Part 3

Post your comments to this page:

All rights reserved. Each contributor to these pages retains the copyright to their own statements, and all quotations therefrom must be attributed to the contributor, not to the editor or any other entity.

Top of This Page | Front Page | Recent Postings Archive | To Part 3

This page hosted by GeoCities. Get your own Free Home Page.