Cataloging the Web

	Sign In Sign-Up

Cataloging the Web
Making the WWW More Accessible

3. Problems & Applications
 
Problems With Keywords
 
Michelle Robertson
What concerns me most about the keyword approach is that, the way the web is growing, simply having "subjects" and even "forms" as types of keywords to identify web sites will become grossly insufficient. If web sites ever begin a trend towards higher degrees of specificity, I think we'll end up with a similar problem to the current one.
 
Eyler Coates
In fact, isn't that what we have now? The tendency seems to be toward websites with very narrowly defined content. Those covering very broad subjects (e.g., "Philosophy") seem more often to be collections of URL's for other sites.
 
Michelle Robertson
Take this example: We find a web site that addresses "The Effect of Man on the Environment." The webmaster has assigned the subjects "man" and "environment." But how can we differentiate between this website and the ones on "The Effects of the Environment on Man?"
 
Eyler Coates
For one thing, there is the title itself. But yours is an excellent point: although a perusal of entries that a search engine might pull up would allow a person to pick out what was wanted by examining titles and abstracts, the idea of an improved system is to narrow the search sufficiently so that there is not so much irrelevant material to sift through. That's what we have now. Indeed, your point focuses on the very heart of the problem of improving access.
 
Michelle Robertson
It seems to me that it would be best to have a variety of possible types of "subject" entries. An "effect of" subject-tag would be very useful in this situation. Of course, most sites wouldn't use it, but to some it would be essential for proper identification. The result for the first site would be something like <subject = environment> and <effect of = man>.
 
Eyler Coates
As this whole discussion thread develops, it begins to become apparent that there will need to be two approaches to "cataloging the web," which are roughly analogous to Books in Print on one hand, and a standard library catalog on the other. This means a rather "rough" access to the entire World Wide Web for persons wanting the broadest possible approach to all the material on the Web, and a more focused and selective approach to those "materials that have (at some level) been evaluated as useful for our users or germane to our missions," in the words of Steve Shadle. Cataloging for the former will mean something that will be confusing, perhaps, without closer examination of the Website itself; cataloging for library purposes will be something with greater specificity, having all the elements you identify clearly spelled out. We might hope that both types of cataloging will be complementary.
One important difference not to be overlooked in the analogy between BIP and libraries on one hand, and the WWW and its Search Engines on the other, is that far more ordinary citizens will be using and dependent on Search Engines than on BIP. Adequate access, therefore, becomes a vital issue.
In considering cataloging for the entire WWW, I think the important thing for us to do is to put ourselves in the mind of both the Surfer and the Webmaster. It is not likely that they will look up a proper term in a list, though it is more likely in the case of the Webmaster. In truth, the topics you use as an example are broad ones. But assuming that the Webmaster has a page that falls within, say, "The Effect of Man on the Environment," I believe we are going to need to be dependent on however the Webmaster describes the Website and just face philosophically the fact that the descriptive tags will be, ultimately, inadequate.
 
Michelle Robertson
A "form" example (off the top of my head): You come across a site dedicated to Marie Antoinette's fictional appearances in literature. How do you differentiate between this site and one that discusses her as a historical character? According to Library of Congress practice, her name would be followed by "in literature." But if you simply provide a subject for her name and for "literature" on the web, that sounds like it might be a site that presents literature written by her, which is very misleading.
With the "form" heading/tag the last example would be solved. Sites that contain literature could have the tag <form = literature>. Sites that are about literature would have the tag <subject = literature>. Sites that focus on both would have both. If there is literature by Marie Antoinette at the site, there could be a tag <author = Marie Antoinette>. This would need to be distinguished from the tag for the author of the web page, though... <grin>
 
Eyler Coates
I believe your examples ably illustrate the need for expert cataloging for the best, library-standard results. They also demonstrate that this level of cataloging will also be beyond what we could expect from the vast majority of Webmasters. Hence, the necessity of the two complementary approaches.
 
Michelle Robertson
I think that keywords are definitely the way to go on the web, for simplicity's sake. But I am sure that users would benefit from greater specificity within the keyword framework. If the subject tags are well-documented, webmasters should have plenty of incentives to use them.
 
Getting Started
 
Eyler Coates
The first requirement to initiate the system would be for some enterprising Search Engine to offer searches based on the META tags described above. This need not replace the present word searches, and it would permit the system to be introduced gradually.
Then what would be required is the cooperation of Webmasters in providing the necessary META tags. Would they do it? Of course they would do it! Right now, many of them try all kinds of tricks in order to get their page recognized, such as filling the page with certain key words. The one thing that Webmasters want after going to all the trouble of creating a Webpage is for everyone to have access to it. Surely, if they were required to put in the appropriate META tags to get listed properly, you can bet they would do it.
 
Joel Hahn
One problem: As you yourself say, "Right now, many of them try all kinds of tricks in order to get their page recognized, such as filling the page with certain key words."
What's to stop these same webmasters from dumping certain key words in the META tags as well, whether or not the words actually describe the page?
Answer: Nothing.
 
Eyler Coates
Not at all. There are still things like honor and honesty. A decent person discovers early on that his interests are not furthered by trickery and deceit. Those who use such methods are discredited and receive the opprobrium of the very people they are trying to attract. But even more than moral disincentives, it would be a simple matter for a Search Engine to have a button for reporting such dishonesty by users, and for the Search Engine, after a simple investigation, to remove the offending Website from its files. There would be nothing to be gained by indulging in such intellectual vandalism.
 
Joel Hahn
Result: Just as many worthless hits.
 
Eyler Coates
Not very likely. Dishonesty seldom pays in the long run. It is true that there would probably be a small number of mischievous kids that might try such tricks in spite of the consequences. But since adequate protections could easily be in place for dealing with such people, why let immature or lawless elements prevent what is otherwise reasonable policy?
 
Joel Hahn
Your idea is nice in theory, but it falls apart due to the chaos of the web and the dishonesty of many webmasters that will try just about anything in order to get "everyone to have access to [their page]."
 
Eyler Coates
It should be noted especially, that those webmasters who try to load their sites with key terms now, do so not to deceive people into going to a site unrelated to their quest. They do it to overcome the insane system we now have, which "catalogs" sites based on the number of times certain words appear on a page of text. Moreover, even now, some Search Engines can detect this tactic and reject such pages. Webmasters are not more dishonest than other people; they are just seeking effective means for promoting their pages.
 
Joel Hahn
Call me a pessimist, but self-moderation only works as long as everybody wants it to work as it should--and it has already been demonstrated that many of the "interested" folk are not interested in seeing it work as it should.
 
Eyler Coates
Nothing works 100% of the time. Even trained catalogers can have their biases that pop up occasionally. A system will work as long as it is in the vital interests of participants that it does work, and as long as there are adequate measures for dealing with those rogues who are intent on disrupting any system, anywhere. Webmasters who will go to all the trouble to establish a website and then promote it with false advertising are fools. They could only hope to gain the anger and ostracism of those who might visit their site. It is also well to remember that the people who pull pranks on the Internet, almost always do so anonymously. They are not, as a rule, interested in advertising to the world that they are knaves.
 
Joel Hahn
Solution: Get a relatively impartial group of people (or a really good fuzzy logic program) to create these fields and store them in a central database. While we're at it, let's call the group of people "catalogers," the data sets "bibliographic records" and the database a "catalog." Also, either use a controlled keyword vocabulary or set up one heck of a SEE ALSO references list.
Solution 2: Start with a core of interested personnel. Tell them what to look for, and monitor their work until they can be reasonably trusted to be impartial/truthful. Create a pyramid, where each person spends some time checking the input of the newest personnel until those people can go on their own and check more people, etc., etc.--much in the same way the Internet Oracle works--rather than relying only on degreed catalogers, for example. Also, either use a controlled keyword vocabulary or set up one heck of a SEE ALSO references list, and make sure the personnel.
 
Eyler Coates
Your solutions are far more complicated and require too many expert personnel. If the WWW explosion is only at its beginning, such high levels of expert control will only increasingly become unable to deal with the volume.
Search engines that rely on just the META tags will be easier to set up. Since a search engine would be concerned only with the data in a page's heading, this would require less storage and might even enable a search engine to visit every page on the Web, and do it more frequently.
 
Specific Applications
 
Gordon Pew
Since I am now a law cataloger, I am interested specifically in what other law libraries are doing on this front (if anything). First, one thread of the discussion is on metadata. Does this imply that the original poster on this issue wanted to embed better internal indexing in Web documents? I thought the issue was whether we catalogers should create bibliographic records, resident in our OPACS, for documents available on the WWW.
Are any law libraries beginning to catalog the Web in any way? Who in the library decides which Web documents should be cataloged (with an example or two)? How do/would you monitor Web resources to be sure they haven't been substantially changed or even deleted from the Internet? I am not so much interested in the mechanics (classification, description, subject analysis) of cataloging Internet resources: rather, the rationale for doing it in the first place.
 
Vianne Sha
Our library started cataloging Internet resources since we joined the OCLC InterCat project in 1995. We use OCLC to catalog Internet resources as well as any other resources. I believe some libraries have a collection development committee to decide what to catalog. I have drafted a selection policy for my library with other librarians' consensus and decided what to catalog according to my guidelines. A URL checking program is used to monitor the URL changes of the resources we cataloged. Sometimes human review is necessary to verify content changes. OCLC's PURL is a good way to avoid changing the URL in OCLC and local catalog even though you still need to change the URL in the PURL server. If you want to know more about what I propose to solve this problem in the library field, please attend the B-6 program in American Association of Law Libraries Annual Meeting in Baltimore.
My rationale for cataloging Internet resources is based on:

1) integrating all formats of information resources under one search engine (the library OPAC) for our local users;
2) Internet resources are information that is no different from other information resources such as books and microfilms;
3) we can never catalog all the books in the world, but we are still selecting and cataloging them.

The same theory applies to Internet resources. I don't think any system can organize all Internet resources in the world right now, but we are still trying to organize them and make some good uses of them.
 
Go To Part 4

Post your comments to this page:

All rights reserved. Each contributor to these pages retains the copyright to their own statements, and all quotations therefrom must be attributed to the contributor, not to the editor or any other entity.

Top of This Page | Front Page | Recent Postings Archive | To Part 4

This page hosted by GeoCities. Get your own Free Home Page.