Cataloging the Web

	Sign In Sign-Up

Cataloging the Web

Making the WWW More Accessible

4. Conclusion: A System for the Web
 
Charley Pennell
While web resource-embedded metadata is fine for fairly unique items which people search using known access points, it is not terribly useful for subject searches, partly due to the lack of a controlled vocabulary in Dublin Core, TEI headers, etc. and partly due to the web search engines' lack of any collocating function between documents such as is provided by classification or thesauri. Because metadata is not uniformly available in all net documents, the search engines still seek relevant terms from throughout the document, thus further diluting the provided metadata's usefulness.
 
Eyler Coates
The system as proposed in this thread is not intended to rely upon metadata as it now exists. What we are suggesting is a system that could gradually be implemented, and that would furnish improved access as its acceptance became more widespread. Right now, there exists something approaching chaos. Indexing every term in a document is not really a system; or it might be called a default system used to provide access in the absence of a deliberate and rationally organized system. Indexing every word off a webpage is a kind of catch-all, that will provide a "best we can do" access in the absence of a rational system. Already, in these exchanges we have stumbled on a systemized way of employing an UNcontrolled vocabulary, but yet producing controlled results. This is done by having unskilled Webmasters supply an array of subject headings for the same subject, but with a professional librarian at the Search Engine, who will take these user-supplied "See" references (and what better source for See references than the unskilled people who will be using the search engine?) and make all the various terms that refer to one subject category in fact refer to that one category. Thus, you would have a single subject category, but you would have a multiple number of terms that give access to that category. This might sound off hand like another kind of chaos, but what it does is convert the chaos that exists in the user's approach to searching into an organized referral. If a group of Webmasters each used a different Keyword for the same subject, the cataloger's job would be to see that the Search Engine referred a search for any one of those terms to ALL the documents which contained any one of those keywords in its metadata.
 
Charley Pennell
Letting useful resources languish out on the network is not the solution for our patrons in any case. It is the library electronic catalogue that combines our expertise in selection, retrieval and increasingly, delivery, of information to our clientele. Getting this data into a form which cataloguers could use might best be achieved by an electronic CIP arrangement with information providers who would contact us (depending on the constituency of the provider) to see about getting machine readable copy attached to worthy network resources. For an example of how this might work, see:

http://www.statcan.ca/english/Dsp/81-003-XPB/81-003-XPB.htm
 
Eyler Coates
Unfortunately, the WWW is too big (and growing), too wild, and too mutable to be limited to the facilities appropriate for a former age. We are talking about an explosion, and we are just now at the beginning of it. We have materials being produced and made available so easily and so cheaply, no system that relies on the passage of such an enormous flow of materials through the hands of professionals working on them one at a time will meet the challenge of this new age. At best, such a system would always mean that only a small portion of the available materials would be processed, thus failing to provide technological advances in cataloging to correspond with the technological advances in data production.
 
Steve Shadle
But currently we only provide access (through library and other bibliographic information services) to a portion of available materials, but these are materials that have (at some level) been evaluated as useful for our users or germane to our missions. (Contrary to popular belief, the Library of Congress does not contain every book ever published. ;-)
I wholeheartedly agree that user-supplied metadata can bring order to the Internet universe. But if the Web is growing as exponentially as presumed, then its even more important that we lay the groundwork for the ability to enable a person to:

find a work by a given author (Where's my John Smith? Who are Carr/Holt/Kellow/Plaidy/Tate? Which one is Bill Clinton: William E., William J. or William R.?)
identify the intellectual work (vs. the manifestation) (Hamlet, the Apocrypha or Beethoven's Eroica by any other name)
provide information about the bibliographic relations between works (editions, revisions)
identify the genre/format of the work (web site vs. text document; review article vs. research report).

Our current catalogs attempt these functions with some degree of success. For those items which are significant to our users, shouldn't we provide this same level of identification/control? I'm not saying we need to do it in the same way. I think the Dublin Core comes closer to providing the elements necessary for us to provide this level of description than Eyler's suggested metadata structure (although his elements can serve as a basis for basic description; the Dublin Core is expansible after all).
I applaud Eyler's suggestions for metadata subject authority and think the use of user-supplied data for the Web universe is better than we've ever been able to do in print. But I can't help but think that there's more to it than just subjects and that we can still provide a selection (not everything that's submitted to Yahoo gets in) and identification role as we do with print resources.
 
Eyler Coates
Steve's response illustrates an interesting difference of perspective. Although a former librarian (having served time as a cataloger) and now retired, I am right now above all a Webmaster and Internet user. The Web exists in my view as a entity unto itself, further apart from libraries than are books and publishing. In another sense, it is like a separate genre from books, such as periodicals, music recordings, and film. Including Web resources in a library main catalog strikes me, offhand, as strange; something like including journal articles.
Cataloging the Web for itself might be considered similar to the "cataloging" in Books in Print. Would libraries include a complete catalog of every book in print if, through some mechanism, every one of them were available on a TV screen for library users? Probably not. In fact, maybe that is a good analogy. Would a library catalog then be a selected guide? Would it exist alongside this other, perhaps less detailed, catalog that provides access to *everything*, just as it now exist alongside BIP and CBI?
In that case, it is perfectly reasonable to assume that a library might selectively include certain unique Web resources in its catalog, just as Steve suggests, since they are so much the equivalent of a book. Indeed, I consider my main website, Thomas Jefferson on Politics & Government (forgive the plug ;-), as something that probably should exist in book form. There is no print book available that has assembled together as extensive a collection of Jefferson's ideas on politics and government, in his own words, as that website. Isn't it reasonable that libraries might want to include that type of resource in their catalog?
Therefore, what we end up with is a dual set of needs. There is a need for an adequate access tool to the entire Web, just as there is a need for periodical indexes or BIP. There is also a need for an amplified access tool to certain Web resources that would be appropriate for libraries and their computerized catalogs. Ideally, the latter should build upon the former, rather than being an entirely separate approach. The analogy with BIP breaks down here, however, because the Web catalog is not simply a list. The resources are readily at hand. Researchers will want to access the whole, big, messy thing, and they will want an access tool that is the best that is reasonably available, given the level of technology.
Moreover, the whole computer thing introduces new elements: besides new ways of storing the data, there are new ways of accessing the data, i.e., of forming searches. If libraries are tied to the "pre-coordinate" subject heading system, however, this might pose real problems. If anything has emerged from the present discussion, it is that LCSH and DDC are not right for the entire Web. The access tool suitable for the select Web will not be practicable for the entire Web, just as it would be impracticable to include complete cataloging data for every item in BIP. The Web promises to be a giant resource in the future, deserving of a cataloging system that would be adequate for its nature. That is one job. Libraries may need to provide selective, detailed access to Web resources. That is another job. But the two ought to be related in some rational way.
 
Kate Bowers
Let us remember that we are experts. We have become very skilled at deriving meaningful data from efficient inspection of an item. The records we create adhere not only to rigid standards of encoding, but also to intelligent rules for content. International standards for library cataloging have been in place since 1908. Professional ethics require that catalogers remain impartial. Catalogers follow and create standard lists. We contribute to and use authority files to ensure that the millions of John Williams in the world who create works have separate name entries.
This is only scratching the surface of the enterprise of cataloging.
In a situation where non-expert persons create records, they do many things unlike library cataloging. Lack of standard authority files will mean that common names will remain undifferentiated. Much as we lament the difficulty of remaining unbiased, we at least recognize the problem, and see more ideas and works from more corners of the world than most people. Lack of this broad overview of content will mean that basic assumptions, let alone biases, will be missed. For instance, a hate group with a Web site is unlikely to note "hate" or "racism" in its keywords; they will not see themselves this way. In a more innocuous example, and one that really exists in paper, picture a work with the word "Constitution" in large letters at its top. Which constitution is it? Is it the US constitution? The Malawi constitution? A translation of a defunct Belorussian constitution? or the Constitution of the Ladies' Auxiliary of the Youngstown, Ohio Knights of Columbus? The work itself may not tell one enough information about it. Its creators know who they are, and are not aware that their identity and that of the document in their hands is linked, but that outside world cannot know this.
This is only scratching the surface of the enterprise of organizing the web.

Post your comments to this page:

All rights reserved. Each contributor to these pages retains the copyright to their own statements, and all quotations therefrom must be attributed to the contributor, not to the editor or any other entity.

Top of This Page | Recent Postings Archive | Front Page

This page hosted by GeoCities. Get your own Free Home Page.