Cataloging the Web

	Sign In Sign-Up
Cataloging the Web
Making the WWW More Accessible
Recent Postings



Subject: Re: Cataloging the Web
Date: Mon, 28 Apr 1997 19:54:51 -0700
From: "Eyler Coates, Sr." eyler.coates@worldnet.att.net
Organization: http://www.webspawner.com/users/EylerCoates/
Newsgroups: bit.listserv.autocat


Charley Pennell wrote:
>
> Folks-
>
>   While web resource-embedded metadata is fine for fairly unique items
> which people search using known access points, it is not terribly useful
> for subject searches, partly due to the lack of a controlled vocabulary
> in Dublin Core, TEI headers, etc. and partly due to the web search
> engines' lack of any collocating function between documents such as is
> provided by classification or thesauri. Because metadata is not
> uniformly available in all net documents, the search engines still seek
> relevant terms from throughout the document, thus further diluting the
> provided metadata's usefulness.

The system as proposed in this thread is not intended to rely upon
metadata *as it now exists*.  What we are suggesting is a system that
could gradually be implemented, and that would furnish improved access as
it's acceptance became more widespread.  Right now, there exists
something approaching chaos.  Indexing every term in a document is not
really a system; or it might be called a default system used to provide
access in the absence of a deliberate and rationally organized system.
Indexing every word off a webpage is a kind of catch-all, that will
provide a  "best we can do" access in the absence of a rational system.
 Already, in these exchanges we have stumbled on a systemized way of
employing an UNcontrolled vocabulary, but yet producing controlled
results.  This is done by having unskilled Webmasters supply an array of
subject headings for the same subject, but with a professional librarian
at the Search Engine, who will take these user-supplied "See" references
(and what better source for See references than the unskilled people who
will be using the search engine?) and make all the various terms that
refer to one subject category in fact refer to that one category.  Thus,
you would have a single subject category, but you would have a multiple
number of terms that give access to that category.  This might sound off
hand like another kind of chaos, but what it does is convert the chaos
that exists in the user's approach to searching into an organized
referral.  If a group of Webmasters each used a different Keyword for the
same subject, the cataloger's job would be to see that the Search Engine
referred a search for any one of those terms to ALL the documents which
contained any one of those keywords in its metadata.

>   Letting useful resources languish out on the network is not the
> solution for our patrons in any case. It is the library electronic
> catalogue that combines our expertise in selection, retrieval and
> increasingly, delivery, of information to our clientele.
>
> Getting this
> data into a form which cataloguers could use might best be achieved by
> an electronic CIP arrangement with information providers who would
> contact us (depending on the constituency of the provider) to see about
> getting machine readable copy attached to worthy network resources.  For
> an example of how this might work, see:
>
>         http://www.statcan.ca/english/Dsp/81-003-XPB/81-003-XPB.htm

Unfortunately, the WWW is too big (and growing), too wild, and too
mutable to be limited to the facilities appropriate for a former age.  We
are talking about an explosion, and we are just now at the beginning of
it.  We have materials being produced and made available so easily and so
cheaply, no system that relies on the passage of such an enormous flow of
materials through the hands of professionals working on them one at a
time will meet the challenge of this new age.  At best, such a system
would always mean that only a small portion of the available materials
would be processed, thus failing to provide technological advances in
cataloging to correspond with the technological advances in data
production.

Eyler Coates

--

=============================================================
All of the postings to this thread are available in a redacted
form, without repetitions and irrelevant matter, at:

                     Cataloging the Web
                Making the WWW More Accessible

   http://www.geocities.com/Athens/Forum/1683/cwindex.htm

==============================================================






Subject: Re: Cataloging the Web
Date: Mon, 28 Apr 97 14:20:38 +0000
From: Charley Pennell 
Reply-To: cpennell@morgan.ucs.mun.ca
Organization: QEII Library, Memorial University of Newfoundland
To: eyler.coates@worldnet.att.net
References: <199704280425.BAA16478@piva.ucs.mun.ca>

Folks-

  While web resource-embedded metadata is fine for fairly unique items
which people search using known access points, it is not terribly useful
for subject searches, partly due to the lack of a controlled vocabulary
in Dublin Core, TEI headers, etc. and partly due to the web search
engines' lack of any collocating function between documents such as is
provided by classification or thesauri. Because metadata is not
uniformly available in all net documents, the search engines still seek
relevant terms from throughout the document, thus further diluting the
provided metadata's usefulness.

  Letting useful resources languish out on the network is not the
solution for our patrons in any case. It is the library electronic
catalogue that combines our expertise in selection, retrieval and
increasingly, delivery, of information to our clientele.  Getting this
data into a form which cataloguers could use might best be achieved by
an electronic CIP arrangement with information providers who would
contact us (depending on the constituency of the provider) to see about
getting machine readable copy attached to worthy network resources.  For
an example of how this might work, see:

        http://www.statcan.ca/english/Dsp/81-003-XPB/81-003-XPB.htm

_______________________________________________________________________
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
Charley Pennell                              cpennell@morgan.ucs.mun.ca
Head, Cataloguing Division                         voice: (709)737-7625
Queen Elizabeth II Library                           fax: (709)737-3118
Memorial University of Newfoundland
St. John's, NF  Canada   A1B 3Y1

World Wide Web: http://sicbuddy.library.mun.ca/~charl8P9/chuckhome.html
Cataloguer's Toolbox:                     http://www.mun.ca/library/cat
_______________________________________________________________________
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""





Subject: Re: Cataloging the Web
Date: Mon, 28 Apr 1997 09:44:35 -0700
From: "Eyler Coates, Sr." eyler.coates@worldnet.att.net
Organization: http://www.webspawner.com/users/EylerCoates/
Newsgroups: bit.listserv.autocat,schl.sig.lmnet


Robert Cunnew wrote:
>
> In article <336372F2.7388@worldnet.att.net>, "Eyler Coates, Sr."
>  writes
>
> >(4) SUBJECTS.  This would require the existence of a standard list of
> >subject headings, probably made available by the Search Engine (see
> >below), from which Webmasters could select (with helps) up to five
> >appropriate headings for their own page and put them in a tag:
> >       
> >Rather than use something as complicated as the Library of Congress
> >Subject Headings, something less detailed such as the Sears Subject
> >Headings, would probably be sufficient for the WWW.
>
> I'm not familiar with Sears, but isn't it - like LC - precoordinate?
> Please don't let's suggest that the Web is cluttered up with nineteenth
> century notions of subject access designed for catalogue cards.  Simple
> postcoordinate terms are what is required, eg term 1, Libraries, term 2,
> United States, *not* "Libraries - United States" or whatever Sears has
> to offer.  Search engines may not be perfect but they *can* do Boolean,
> even if it's often implicit rather than explicit.

These are excellent points.  I received an email response that suggested
the possibility of keywords as an alternative to either Sears or LCSH.
The matter of subject headings or keywords seems to be a crucial part of
the system.  If this general scheme were adopted, it seems that the
"keyword" option might be the most desirable.  It may be that a Search
Engine could provide a standard list of Keywords created from those
actually used by Webmasters, and that this list could serve as a
reference list to maintain a level of uniformity.  Webmasters could then
create new Keywords if there were none on the list adequate for their
needs.  Thus, the Keyword List would be constantly brought up to date.
Such a list would also be useful for user/researchers while browsing.  In
addition, a human being (cataloger) could monitor the list, creating
appropriate "See Also" references as part of the Search Engine's
offerings and perhaps even making redundant Keywords, created by
Webmasters (who are necessarily amateur catalogers), all refer to the
same items.  In this way, the equivalent of "See" references would be
created by the Webmasters, but their combination into ONE real reference
would be supervised and maintained by the professional.  Interestingly
enough, unlike the standard card catalog "See" references, these
"See"-like references would all be considered equally authentic.  This,
then, would be a dynamic system that would be in a constant state of
evolvement.

> Given the undesirability of precoordination in subject indexing, I
> wonder whether there is a need for (5) Forms, taken from a short list of
> appropriate terms, eg information service, promotional material, images,
> sounds, software (form not subject, ie downloadable), news, directory,
> discussion forum ...

I regrettably must confess that I am unfamiliar with your frame of
reference here and am uncertain of your meaning.  I had suggested five
subject headings as a maximum, though if the Keyword option were
selected, it might seem that more than five would be required.  Are we
talking about the same thing?

Eyler Coates

=============================================================
All of the postings to this thread are available in a redacted
form, without repetitions and irrelevant matter, at:

                     Cataloging the Web
                Making the WWW More Accessible

   http://www.geocities.com/Athens/Forum/1683/cwindex.htm

==============================================================






Subject: Re: Cataloging the Web
Date: Mon, 28 Apr 1997 06:13:36 -0700
From: "Eyler Coates, Sr." 
Organization: http://www.webspawner.com/users/EylerCoates/
To: mac@slc.bc.ca
References: <199704280427.VAA24101@bmd2.baremetal.com> 

J. McRee Elrod wrote:
>
> >"If the producers of every site cataloged it themselves, then Yahoo!
> >wouldn't have a hard time keeping up with them. Of course, everyone would
> >have to agree on standard ways to do this, and if everyone agreed,
> >for-profit search sites like Yahoo! probably wouldn't be necessary."
>
> Mr. Coates elaboration of this proposal seems excellent to me.  I have
> only one addendum, and one disagreement.
>
> >(3) NAMES.
>
> Perhaps this should be qualified as personal and corporate names, to
> clarify that both are included, and to exclude "names" like names of
> chemicals, plants, and such.

This is a very good point and should be included in instructions for
Webmasters.

> >(4) SUBJECTS.
> ...
> >Rather than use something as complicated as the Library of Congress
> >Subject Headings, something less detailed such as the Sears Subject
> >Headings, would probably be sufficient for the WWW.
>
> The subjects of web sites are even more up-to-date and various than
> those of books.  My experience of Sears is that it is too limited even
> for books.  I would suggest LCSH supplemented by the most recent annual
> cumulation of the appropriate Wilson periodical index.
>
> Another option, of course, would be keywords.

Another posting to the Newsgroup also relates to this problem, suggesting
a rejection of both the Sears and LCSH in favor of search terms that
might permit a Boolean search, which I assume would mean the use of
keywords.  This is a crucial part of the system, and the point that you
make about the subjects of websites being "more up-to-date and various
than those of books" is a salient one.  If this general scheme were
adopted, it seems that the "keyword" option might be the most desirable.
 It may be that a Search Engine could provide a standard list of Keywords
created from those actually used by Webmasters, and that this list could
serve as a reference list to maintain a level of uniformity.  Webmasters
could then create new Keywords if there were none on the list adequate
for their needs.  Thus, the Keyword List would be in a constant state of
evolvement.  Such a list would also be useful for user/researchers while
browsing.  In addition, a human being (cataloger) could monitor the list,
creating appropriate "See Also" references as part of the Search Engine
offering and perhaps even making redundant Keywords, created by
Webmasters (who are necessarily amateur catalogers), all refer to the
same items.  In this way, the equivalent of "See" references would be
created by the Webmasters, but their combination into ONE real reference
would be supervised and maintained by the professional.  This, then,
would be a dynamic system that would be in a constant state of
evolvement.

Eyler Coates

--
============================================================
        Thomas Jefferson on Politics & Government
        http://pages.prodigy.com/jefferson_quotes

  Eyler Robert Coates, Sr.  eyler.coates@worldnet.att.net
============================================================






Subject: Re: Cataloging the Web
Date: Mon, 28 Apr 97 02:08:32 +0000
From: mac@slc.bc.ca (J. McRee Elrod)
Reply-To: mac@slc.bc.ca
Organization: Special Libraries Cataloguing, Inc.
To: eyler.coates@worldnet.att.net
CC: autocat@ubvm.cc.buffalo.edu
References: <199704280427.VAA24101@bmd2.baremetal.com>

>"If the producers of every site cataloged it themselves, then Yahoo!
>wouldn't have a hard time keeping up with them. Of course, everyone would
>have to agree on standard ways to do this, and if everyone agreed,
>for-profit search sites like Yahoo! probably wouldn't be necessary."

Mr. Coates elaboration of this proposal seems excellent to me.  I have
only one addendum, and one disagreement.

>(3) NAMES.

Perhaps this should be qualified as personal and corporate names, to
clarify that both are included, and to exclude "names" like names of
chemicals, plants, and such.

>(4) SUBJECTS.
...
>Rather than use something as complicated as the Library of Congress
>Subject Headings, something less detailed such as the Sears Subject
>Headings, would probably be sufficient for the WWW.

The subjects of web sites are even more up-to-date and various than
those of books.  My experience of Sears is that it is too limited even
for books.  I would suggest LCSH supplemented by the most recent annual
cumulation of the appropriate Wilson periodical index.

Another option, of course, would be keywords.

Mac

   __       __   J. McRee (Mac) Elrod (mac@slc.bc.ca)
  {__  |   /     Special Libraries Cataloguing   HTTP://www.slc.bc.ca/
  ___} |__ \__________________________________________________________





Date:         Mon, 28 Apr 1997 08:37:16 -0700
Reply-To:     Steve Shadle shadle@u.washington.edu
Sender:       "AUTOCAT: Library cataloging and authorities discussion group"
              AUTOCAT@LISTSERV.ACSU.BUFFALO.EDU
From:         Steve Shadle shadle@u.washington.edu
Subject:      Re: Cataloging the Web

I agree with much of what is presented in the article summary, but I do
have a couple comments that I would like to hear other people's thoughts
on.

> different.  For example, classification systems (such as Dewey, LC) are
> irrelevant because they are designed for grouping physical objects on
> shelves for browsing, access and retrieval.  Unnecessary elements only

My understanding is that the use of classification *solely* for grouping
physical objectives is a North American (or at least non-European)
practice and that European libraries more frequently used classed catalogs
and that it is not an uncommon practice to assign multiple classifications
to a work.  Finding works on related subjects is important and unless a
subject authority system (whether a simple hierarchy like YAHOO or a more
complex structure like LCSH) is in place, classification can be used to
facilitate this type of access.

One of the specific points I would like feedback on is whether there are
institutions out there that feel the need to assign classification
*solely* for subject access (i.e., for resources that don't sit on a
shelf).  Do catalog users *use* classification as a subject retrieval
mechanism?

> of retrieval.  If users want more complete information, they can click on
> the document itself, unlike in a library where they would need to go up
> an elevator to the fourth floor to look at the document.  Therefore, Web
> Cataloging need only concern itself with retrieving a good list of mostly
> relevant documents that the user can then examine more closely.

I had this same thought, but I've had students who disagree with this
point.  When servers are down, when the Net is overloaded and one can't
connect to a resource for whatever reason, the catalog can serve as a much
quicker mechanism for *identifying* and citing resources.  It seems that
this generation of workstation users are impatient with even a 15-second
wait...getting an instant summary and brief description from a catalog
record may provide a better service to a large group of users.

IMHO, the use of user-supplied data would help immensely in bringing
organization to the vast majority of materials on the net. However, there
are some basic concepts in bibliographic description and identification
(e.g., name authority) that have the potential to be useful both in web
browsers and online catalogs.  Cutter's principles don't become irrelevant
in a networked world.

And can anyone out there tell me what the current status of the Dublin
Core (and other metadata schemes) are in terms of development,
establishment and actual acceptance in the community?

Thanks to Eyler Coates, Sr. for posting the summary and to you for reading
this rather random set of thoughts. I look forward to discussion and
especially to specific examples.  --Steve

    Steve Shadle           shadle@u.washington.edu                * * * *
    Serials Cataloger                                              * * *
    University of Washington Libraries, Box 352900                  * *
    Seattle, WA 98195                               (206) 543-4890   *






Subject: Re: Cataloging the Web
Date: Sun, 27 Apr 1997 20:57:32 +0100
From: Robert Cunnew 
Organization: N/A
Newsgroups: bit.listserv.autocat,schl.sig.lmnet
References: <336372F2.7388@worldnet.att.net>

In article <336372F2.7388@worldnet.att.net>, "Eyler Coates, Sr."
 writes

>(4) SUBJECTS.  This would require the existence of a standard list of
>subject headings, probably made available by the Search Engine (see
>below), from which Webmasters could select (with helps) up to five
>appropriate headings for their own page and put them in a tag:
>       
>Rather than use something as complicated as the Library of Congress
>Subject Headings, something less detailed such as the Sears Subject
>Headings, would probably be sufficient for the WWW.

I'm not familiar with Sears, but isn't it - like LC - precoordinate?
Please don't let's suggest that the Web is cluttered up with nineteenth
century notions of subject access designed for catalogue cards.  Simple
postcoordinate terms are what is required, eg term 1, Libraries, term 2,
United States, *not* "Libraries - United States" or whatever Sears has
to offer.  Search engines may not be perfect but they *can* do Boolean,
even if it's often implicit rather than explicit.

Given the undesirability of precoordination in subject indexing, I
wonder whether there is a need for (5) Forms, taken from a short list of
appropriate terms, eg information service, promotional material, images,
sounds, software (form not subject, ie downloadable), news, directory,
discussion forum ...
--
Robert Cunnew
Librarian, Chartered Insurance Institute, London






Subject: Cataloging the Web
Date: Sun, 27 Apr 1997 08:38:26 -0700
From: "Eyler Coates, Sr." 
Organization: http://www.webspawner.com/users/EylerCoates/
Newsgroups: bit.listserv.autocat,schl.sig.lmnet

The current issue of Slate (http://www.slate.com) contains an article by
Bill Barnes in his Webhead column, "Search Me," on the inadequacy of
access to "the vast resources" of the Web as compared to any library.  He
examines the various search engines available, and details how
unsatsifactory they are.  He didn't really propose a solution to the
problem, but his conclusion included the following:

"If the producers of every site cataloged it themselves, then Yahoo!
wouldn't have a hard time keeping up with them. Of course, everyone would
have to agree on standard ways to do this, and if everyone agreed,
for-profit search sites like Yahoo! probably wouldn't be necessary."

A lot of people have discussed this problem, but no one that I have
discovered has proposed a simple, comprehensive, workable solution as
yet.  I posted to Slates's "The Fray" an outline of the following
possible overall structure that might meet the problem.  It is presented
in more detail here in hopes that other people might have some input, and
we can collectively arrive at a mechanism for providing better access to
the WWW.

CATALOGING THE WEB

A practical system must take into account the nature of materials on the
Web, the people who create it, the search engines that find it, and the
needs of the people who research it.  A system similar to that provided
by library services will not work, because so many of the elements are
different.  For example, classification systems (such as Dewey, LC) are
irrelevant because they are designed for grouping physical objects on
shelves for browsing, access and retrieval.  Unnecessary elements only
add unnecessary complication.  Web document research doesn't need
complete bibliographic data.  What's really needed is an efficient means
of retrieval.  If users want more complete information, they can click on
the document itself, unlike in a library where they would need to go up
an elevator to the fourth floor to look at the document.  Therefore, Web
Cataloging need only concern itself with retrieving a good list of mostly
relevant documents that the user can then examine more closely.

Another factor is the level of expertise of the people that will
necessarily be doing the cataloging.  Already, Web resources are vast,
constantly changing, and only promise to be more so in the future.  It is
necessary, therefore, that the cataloging be simple enough to be done by
an ordinary Webmaster and not require the services of a professional.

The basic requirements of people who search the Web for materials are (1)
Titles, (2) Names, (3) Subjects, and (4) Brief Abstracts accompanying the
results of searches for the first three elements.  Other elements, such
as dates, publishers (Webmasters?), editions, etc., could be obtained by
clicking on the document itself.  It is interesting that computers give
incredible forms of access to data, but to date, the best that Search
Engines can do has been to index every word that appears on a webpage.
This, however, produces very unsatisfactory search results most of the
time.  If searches could be conducted via the three elements above, with
an abstract included with each finding, the results would be far more
satisfactory for users.

What is needed, therefore, are four page attributes, which could be
included in each document's head.  Two of these already are usually found
on every webpage.

(1) TITLES.  Already included in the header of every Webpage:
        --------------

(2) DESCRIPTION (Abstract).  This is a standard META tag, using the
designation:
        .

(3) NAMES.  This would require a new META tag, which could include up to
five names, selected appropriately by the Webmaster, of authors, editors,
corporations, subjects of biographies, etc., using the designation:
        
Distinguishing between authors, editors, etc., would be unnecessary,
because the abstract should make clear the relationship of the various
names to the page content.  Also, the document is right there and can be
clicked on if the user wants more precise information.

(4) SUBJECTS.  This would require the existence of a standard list of
subject headings, probably made available by the Search Engine (see
below), from which Webmasters could select (with helps) up to five
appropriate headings for their own page and put them in a tag:
        
Rather than use something as complicated as the Library of Congress
Subject Headings, something less detailed such as the Sears Subject
Headings, would probably be sufficient for the WWW.  The only problem is,
such a list is not presently available on the Web, although it probably
would not be too difficult to establish a similar list of subject
headings that Webmasters could use.  If some public institution supplied
such a list on the WWW, all search engines could use it, and everybody
would benefit from the uniformity.  Simplicity combined with lots of help
would be necessary, because it would require application by Webmasters
without professional training.

Note that the Search Engine itself would contain the equivalent of "See"
and "See Also" references.  "See" references would automatically bring up
the referenced materials in most cases, whereas "See Also" references
should present options to the user.  In some cases, a search could bring
up a request for more specific attributes from the user.  All of these
would be elements built into the search engine's program.

The first requirement to initiate the system would be for some
enterprising Search Engine to offer searches based on the META tags
described above.  This need not replace the present word searches, and it
would permit the system to be introduced gradually.

Then what would be required is the cooperation of Webmasters in providing
the necessary META tags.  Would they do it?  Of course they would do it!
 Right now, many of them try all kinds of tricks in order to get their
page recognized, such as filling the page with certain key words.  The
one thing that Webmasters want after going to all the trouble of creating
a Webpage is for everyone to have access to it.  Surely, if they were
required to put in the appropriate META tags to get listed properly, you
can bet they would do it.

Search engines that rely on just the META tags will be easier to set up.
 Since a search engine would be concerned only with the data in a page's
heading, this would require less storage and might even enable a search
engine to visit every page on the Web, and do it more frequently.

Eyler Coates

--
============================================================
        Thomas Jefferson on Politics & Government
        http://pages.prodigy.com/jefferson_quotes

  Eyler Robert Coates, Sr.  eyler.coates@worldnet.att.net
============================================================
Post your comments to this page:

To Front Page
This page hosted by GeoCities. Get your own Free Home Page.