Diese Website ist seit dem Ende des Studiengangs Informationswissenschaft
im Juni 2014 archiviert und wird nicht mehr aktualisiert.
Bei technischen Fragen: Sascha Beck - s AT saschabeck PUNKT ch
Drucken

Projekte

How To Use General Design Issues and Metadata

How To Use General Design Issues and Metadata
In Order To Get Your Web Page Picked Up By Search Engines

Werner Schweibenz

Department of Information Science, University of Saarland, Germany
w.schweibenz@rz.uni-sb.de
Paper presented at the Second International Workshop
Exploring a Communication Model for Web Design, Seattle, WA, July 10-17, 1999

Outline

  1. Introduction
  2. A Survey of Search Services on the World Wide Web
  3. How Big is The Web and How Much of it is Covered By Search Services?
  4. How do Search Engines work?
  5. General Design Issues for Web Site Promotion
  6. How to Use Metadata for Resource Description on the Web
  7. Conclusions
    Reference

Abstract

The paper gives an introduction to information retrieval on the World Wide Web. It presents a survey of Web search services, discusses how big the Web is, what sizes search engines have and how they work. Ten recommendations for proactive Web design show how one can use general design issues in order to get one’s Web page picked up by search engines. The use of Web metadata, the Dublin Core Metadata Element Set and metadata templates is described.

top of page

1 Introduction

„Currently, search is simply bad.“, admits Joel Truher, vice president of technology for the search engine HotBot in an interview with Online, the leading magazine for information professionals. (Sherman 1999, 54). Truher describes working with a Web search engine as follows: „It’s like interacting with a snotty French waiter. The service is bad, you get served things you didn’t ask for, you often have to order again and again, and you can’t get things that are listed on the menu.“ (Sherman 1999, 54-55)

The reason for the problems in information retrieval on the World Wide Web is partly the size of the Web and its lack of structure which makes searching a difficult and complicated task (Dong/Su 1997, 67-68), partly caused by the way Web search tools work. Web searching is keyword searching on some kind of index or directory of textual documents (Sherman 1999, 55). When using Web search tools one faces the problem that the „search tools retrieve too many documents, of which only a small fraction are relevant to the user query. Furthermore, the most relevant documents do not necessarily appear at the top of the query output order“, as Gudivda et al. (1997, 58) emphasize. Although there is only little known about the retrieval models of commercial search services (Gudivda et al. 1997, 58), there are certain measures Web designers can take to improve the retrieval of a Web page.

The scope of the paper is how general Web design issues and metadata can be used in order to create a Web page that is likely to be picked up and promoted by search engines. These questions dealing with search engine persuasion and Web site promotion are of special interest for Web designers because even a well designed and information-abundant Web site is of little use if people who search the Web for information do not find it. This is not so unlikely to happen as it may sound, because the coverage of the Web by search engines is comparatively low. Accordingly, in order to get their Web sites noticed by search engines, Web designers have to use in the design process certain means of Web site promotion described in this paper. Before focusing on these methods of proactive Web design, the paper gives a survey of search services of the Web, how they work, how big the size of the Web is estimated and how much of it is covered by search services. This will provide the reader with the background information which is necessary for the understanding of proactive Web design and metadata issues.

top of page

2 A Survey of Search Services on the World Wide Web

A variety of different search services support Web users in finding information on the World Wide Web. These search services can be put into seven categories: subject lists, Web directories, WebRings, clearing houses, search engines, hybrid search engines, and meta search engines. Subject lists, Web directories, WebRings, and clearing houses are set up and maintained by human beings who serve as indexers and catalogers for these resource collections which are managed manually, whereas the resources of search engines, both individual search engines, hybrid search engines, and meta search engines are gathered automatically by software programs that search the Web and collect data according to certain algorithms.

From the user perspective, there is an difference in how to use these two kinds of search services. According to Stross (1996, 221), there are two general techniques for searching information on the Web: resource categorization by hierarchical or free text indexes like Yahoo and active resource discovery by search engines like Lycos. In the first case, users take a more passive role and follow the given structures of highly organized information in subject lists, Web directories, WebRings, and clearing houses. In the second case, they have to express their information needs actively by selecting keywords, formulating a query and using logical operators in order to query any kind of search engine.

From the technical perspective there are differences in the organization of information that lead to the similar categorization. Schwartz (1998, 974) distinguishes between the two basic types, classified lists and query-based engines. A similar taxonomy is used by Gudivda et al. (1997, 62-64). Classified lists are the browseable search services: subject lists, Web directories, WebRings, and clearing houses. Query-based engines are all kinds of search engines, individual search engines, hybrid search engines, and meta search engines. In practice the lines of distinctions between these two kinds of search services fade as more and more Web search engines add browseable subject tree structures and classified lists use search engines for resource discovery (Tunender/Ervin 1998, 174 and 179). Nevertheless this aspect has to be discussed in greater detail in order to understand the way manually maintained search services and search engines work.

Classified lists are maintained by human indexers and catalogers and therefore these tools can cover only fractions of the Web because it is impossible for human beings to select and index all Web resources (Gudivda et al. 1997, 59-60). Instead of quantity, these services offer quality, i.e. a comparatively small amount of resources organized in a structured form which is easy to use. The users do not have to bother with learning a query language or logical operators like Boolean or proximity operators. They only have to browse through the information structure of these services to find hand-picked information which is usually of high quality as it is reviewed by catalogers or other experts who maintain these services. The drawbacks are that the sizes of these search services are relatively small and that they do not offer the latest information because the providers cannot keep pace with the rapid growth of the Web.

The most focused kind of classified list is the subject list which concentrates on one topic. A subject list is according to Bell (1997, 18) the tool of choice „if the user is new to the Web, because the information has been filtered, analyzed, and organized by human intervention. The user can see all the choices, providing a sense of the variety and type of information available, but the choices are finite and structured.“ An example are the subcategories of the Virtual Library, e.g. the Virtual Library Linguistics (Internet, URL = http://www.emich.edu/~linguist/www-vl.html) or the Virtual Library Applied Linguistics (Internet, URL = http://alt.venus.co.uk/VL/AppLingBBK/welcome.html).

A more diverse collection of information are Web directories that „offer collections of links to Web pages that are rigorously organized into hierarchical subject trees, covering a gamut of subject areas“ (Bell 1997, 18). The strength of Web directories lies primarily in the subject hierarchy. The advantages are a high relevancy of the listed items and the ease of retrieval and navigation. This means that users do not need to know all the synonyms of a search term and do not have to enter complex search statements, instead they are guided through the search process by following hierarchical trees to a focused area where they can make serendipitous discoveries (Callery/Tracy-Proulx 1997, 58-59). In general, Web directories are maintained by professional catalogers who care for substantial quality of information (Callery/Tracy-Proulx 1997, 63). An example is Yahoo (Internet, URL = http://dir.yahoo.com/), one of the biggest and most popular Web directories. Yahoo offers information on technical communication, e.g. the section Technical Writing (Internet, URL = http://dir.yahoo.com/Social_Science/Communications/Writing/Technical_Writing/).

A comparatively new search service that offers very easy navigation for inexperienced users is the WebRing. It was started in April 1996 as a tool to bring together related Web pages under the auspices of a Ring master who administers the WebRing. The participating Web pages are linked together and lined up like pearls on a necklace. The users just have to follow the links and in the end return to the first page without getting lost in a hierarchical structure. An example is the Technical communication WebRing (Internet, URL = http://flywrite.net/thewriteplace/tcomm/). A problem with WebRings is the quality of information because everybody can set up a WebRing. Therefore the quality standards of some WebRings are pretty low.

In contrast to the other classified lists clearing houses deal with very specialized information and are frequently organized like libraries. They often use subject headings and classification systems to sort the information they offer. An example is the Argus Clearinghouse founded by the University of Michigan (Internet, URL = http://www.clearinghouse.net/ and http://argus-inc.com/).

These classified lists are all managed by owners who decide what to add to their lists. Therefore there can no recommendations be made on how to promote a Web page for classified lists except to offer high quality content and register the Web page with these services so that the owners or catalogers can examine it. Although these services are very diverse, generally they appreciate good content because they think of themselves as providers of high quality information.

Query-based engines include all kinds of search engines: individual search engines, hybrid search engines, and meta search engines. In contrast to manually maintained search services, these automatically operated search tools try to cover as much of the Web as possible, index a huge amount of Web pages and collect the information in databases. Although there are various names for individual search engines, e.g. robots, wanderers, gatherers, harvesters or spiders, they all work basically in the same way. According to the WWW Robot FAQ a search engine can be defined as „a program that automatically traverses the Web’s hypertext structure by retrieving a document and recursively retrieving all documents that are referenced“ (Maze/Moxley/Smith 1997, 14).

Maze, Moxley and Smith (1997, 14-39) describe a model search tool as it is used by typical search engines. A typical search engine consists of five components: the discovery robot, the harvesting robot, the indexing robot, the database, and the user interface. The discovery robot traverses the Web and does the resource discovery by following and registering hyperlinks. It is followed by the harvesting robot who extracts parts or all the text of the Web pages discovered by the first robot and stores it in a database where the indexing robot begins its job. The indexing robots sorts the content of the Web page into different indexing fields, e.g. Uniform Resource Locator, title, and headings, which are stored in a database. The users interact with the search engine through the interface where they enter the query and get the results from the database.

Although all search engines basically work as described in this model, there are big differences in how they search, harvest, and index the Web. Moreover the query languages of the search engines are very different from each other. Most of the search engines use Boolean operators, some use proximity operators, a few use concept searches based on linguistic or artificial intelligence methods, etc. This makes Web searching quite complicated for users who are no information specialists (for detailed information on how to use search engines see Hock (1999) and the Infobasic Search Engine Information Page, Internet URL = http://www.infobasic.com/searchengine.html). In order to make searching easier most search engines have combined the query option with a Web directory option. This combination creates hybrid search engines which offer both a classified list and a query-based engine. So the lines of distinctions between these two kinds of search services fade.

As there are big differences in the way search engines work, the most frequent piece of search advice is to always search more than one engine to answer a question (Garman 1999, WWW). Using different search engines for the same query is certainly a good tip, but can be a troublesome and time-consuming task because no two search engines work alike. Therefore meta search engines, also known as megasearch engines, parallel search engines or multiple search engines, were developed in order to make searching easier. They „use one form that simultaneously sends a single query to a number of search engines and then presents the results. The advantage of using these tools is the parallel processing of the search, with each search engine running at the same time. In addition, the better ones present a variety of options for sorting the results and duplicate removal“ (Notess 1998, 2). According to the research of Sander-Beuermann and Schomburg (1998, WWW), „meta-searchengines will deliver 2 to 5 times more results than the best single searchengine“ while the rates of duplicates sorted out range from 10 to 30 percent. The drawback is that meta search engines, although they are complex tools, often cannot use the specific search functions that make the strength of individual search engines. Notess (1998) discusses the pros and cons of using a single search engine and a meta search engine in greater detail. Garman (1999) gives a survey of 19 meta search engines and some recommendations on how to use them effectively.

There is no best search service for the Web. The selection of the adequate tool depends on the user’s information need. Bell (1997) suggests a classification of information needs and the tools that should be used for these needs. In general, classified lists are better for broad questions, while specific questions should be submitted to query-based engines. But no matter what kind of search service one uses, one should always be aware that all these different tools cover only smaller or bigger parts of the Web but not the Web as a whole.

top of page

3 How Big is The Web and How Much of it is Covered By Search Services?

The question how big the Web is, has always been of interest for Web users. In fact, some of the first search engines were developed to measure the size of the Web and its growth (Sonnenreich/Macinta 1998a, 3). Sonnenreich and Macinta offer further a Web page with further information on the history of searching on the Web (A History of Search Engines, Internet, URL = http://www.wiley.com/compbooks/sonnenreich/history.html).

The Web is growing exponentially and search engines try to keep pace with its growth. Search engine companies often claimed to keep up with the growth of the Web and used to brag that they have the biggest databases that cover almost the entire Web. So it was quite a bad publicity for the search engine companies when in spring 1998 the Princeton Report was published (for details on the impact of the report see Sullivan 1999a). This report was based on a study of the NEC Research Institute at Princeton by Lawrence and Giles. It investigated how much of the Web is covered by search engines. The findings were that the percentage of the indexable Web indexed by major search engines is lower than commonly believed (Lawrence/Giles 1998a, 100). An experiment with over 575 queries performed in December 1997 on the six most popular search engines showed that the individual search engines cover only fractions of the Web: HotBot, 34%; AltaVista, 28%; Northern Light, 20%; Excite, 14%; Infoseek, 10%; and Lycos, 3% (Lawrence/Giles 1998a, 100). The size of the indexable Web was estimated from the overlap between the largest two engines and was supposed to be 320 million pages in December 1997. The Princeton Report got big publicity and was followed by a heated discussion and great efforts of search engine companies to improve the Web coverage of their search tools (cf. Sullivan 1999a).

Around the same time, Bharat and Broder, two research specialists with Digital Systems Research Center (where the search engine AltaVista is maintained) had developed a more standardized, statistical way of measuring search engine coverage and overlap. They did two sets of experiments involving over 10,000 queries each in June/July 1997 and November 1997. From these experiments they estimated that the size of the static, public Web as of November 1997 was about 200 million pages (Bharat/Broder 1998a, 380). In mid-1997 the four major search engines covered the following fractions of the Web: HotBot, 47%; AltaVista, 39%; Excite, 32%, and Infoseek 18%. In November 1997 the estimated coverage was: HotBot, 48% (77 million pages); AltaVista, 62% (100 million pages); Excite, 20% (32 million pages); and Infoseek 17% (17 million pages). The joint total coverage was 160 million pages. The overlap of all four search engines was surprisingly small: less than 1.4% of the total coverage, or about 2.2 million pages.

Estimated Web size in million pages and coverage by search engines in percent

Report/search engines Web size total Web
Coverage
HotBot AltaVista Northern
Light
Excite Infoseek Lycos
Princeton Report
December 1997
320 34% 28% 20% 14% 10% 3%
NEC Report
November 1997
200 160 48% 62% 20% 17%

Table 1 Size and Coverage of the indexable Web in late 1997,
according to Bharat/Broder 1998a and Lawrence/Giles 1998a

Both Lawrence and Giles and Bharat and Broder did follow-ups of their studies, Lawrence and Giles in September 1998 (Lawrence/Giles 1998b, WWW) and Bharat and Broder in March 1998 (Bharat/Broder 1998b, WWW). The findings of Bharat and Broder were that as of March 1998 the Web had an estimated size of 275 million pages and that the estimated coverage was: HotBot, 36%; AltaVista, 40%; Excite and Infoseek 12% each.

Estimated Web size in million pages and coverage by search engines in percent and in million pages

Report/search engines Web size total Web
Coverage
HotBot AltaVista Excite Infoseek
NEC Report
November 1997
200 160 48% / 77 62% / 100 20% / 32 17% /17
NEC Report
March 1998
275 36% / 100 40% / 110 12% / 33 12% / 33

Table 2 Size and Coverage of the indexable Web in November 1997 and March 1998,
according to Bharat/Broder 1998a and Bharat/Broder 1998b

Although the results shown in Table 2 varied as far as the ranking of the search engines was concerned, the general results did not change very much in favor of the search engines as far as the coverage of the Web was concerned. Actually, the search engines even covered less than a few month before, which is due to the rapid growth of the Web. According to Bharat and Broder (1998b, WWW) the Web grew from 125 million pages in mid-97 to about 200 million in November 1997 and about 275 million pages as of March 1998. This means that the Web doubled in size in less than nine month and was growing about 20 million pages per month in 1997 and 1998. These results indicate that none of the search engines comes even close to cover the Web and keep pace with its growth.

If one considers how big the Web is and how many Web pages are out there it is of interest for Web designers to know how search engines work and how they can be persuaded to visit a Web page.

top of page

4 How Do Search Engines Work?

In the first section, the components of a typical search engine were described. Now we have to take a closer look on how a search engine works. As already stated, a typical search engine consists of five components: the discovery robot, the harvesting robot, the indexing robot, the database, and the user interface. The most interesting components for our purpose are the harvesting and the indexing robot because they determine how the results look like when the user enters a query and searches the database. So we want to focus on these two components.

The theoretical technical process of harvesting and indexing is described in detail by Maze, Moxley, and Smith (1997, 21-29) and can be summed up as follows. The discovery robot follows the hypertext links of Web pages and in this way discovers new documents. It saves the addresses of these document in a Uniform Resource Locator database which is used by the harvesting robot for retrieving the Web page content. The harvesting robot revisits the Web pages and extracts parts of the text or the whole text of the Web page. It breaks down the harvested text into component words, saves them in a database, that contains the words gathered from all collected Web pages. The indexing robot creates an index for the database. It extracts words from the Uniform Resource Locator, the title and the whole text or at least parts of the text of the Web page and creates the index of the database against which the user queries will be matched. During the indexing process frequently used words are omitted (they are called stop words, e.g. a, and, the, of). But not only the words are stored during the indexing process, but also their positions within the pages. Information about the position of words, how often they are used and the context in which they are used, allows the indexing robot to apply statistical methods to weigh and rank the indexed terms and later the search results for user queries. Words in the Uniform Resource Locator, the title or headings, for example, are weighted and ranked higher than words in the text. Both harvesting and indexing is done by special algorithms, sets of rules and procedures that determine how the search engine does it’s job. Each search engine uses a special set of algorithms and treats them as a trade secret because good algorithms give it a lead over its competitors.

In order to distinguish themselves from their competitors most search engines offer special features and search options. These special features depend, similar to the word indexing functions, on algorithms that sort the content of Web pages into different indexing fields that are called meta words. Meta words describe certain parts of a Web page. Common meta words are, for instance, the title, the Uniform Resource Locator, and the text of links and images (Sonnenreich/Macinta 1998a, 30, 37, 49, 51). Sonnenreich and Macinta (1998a, Chapter 2) give a detailed description of the meta words the major search engines use. As the special features are subject to frequent changes, it is necessary to regularly check new developments at Search Engine Watch, an independent company dedicated to search engine research. It offers the latest news in a regularly updated Search Engine Features Chart for Webmasters (Internet: URL = http://searchenginewatch.internet.com/webmasters/features.html).

Apart from meta words, the major search engines use so called HTML META tags, among them the META tags description and keywords, fields which describe the content of the Web page and offer terms for indexing by search engines. META tags are inserted into the HTML source code of the Web page by the author or a publisher in order to describe the page content and make it more accessible for search engines by offering information about the Web page in special fields. We will deal with META tags later in greater detail.

Besides words and meta words, which are sorted according to these algorithms, the database contains an automatically generated summary of each Web page the harvesting robot has gathered. Generally, the summary consists of the Uniform Resource Locator, the content of the HTML TITLE tag and the first few lines of the Web page’s text or some sentences that are extracted from the text by automatic abstracting according to statistical methods (Maze/Moxley/Smith 1997, 26). Some search engines use the META tag description to create a summary of the Web page (Sonnenreich/Macinta 1998a, 23 and 31). This information is presented on the results page of the search engine interface after the terms of the user’s query have been matched against the terms in the database. That means that a user only searches the index of the database and not the entire Web or parts of it. It is important to keep this in mind because there are restrictions to what information the harvesting robot brings back, for example, there might be access restrictions, technical problems, and dead links. Thus parts of the Web are not harvested and indexed although there are numerous robots searching the Web. In 1999, there are several hundreds of search engines active on the Web (cf. Koster’s Database of Web Robots, Overview (1999), Internet, URL = http://info.Webcrawler.com/makBRrojects/robots/active/html/index.html and Yahoo Search Engines, Internet, URL =http://dir.yahoo.com/Computers_and_Internet/Internet/World_Wide_Web/Searching_the_Web/Search_Engines/ for active search engines).

top of page

5 General Design Issues for Web Site Promotion

Although all search engines basically work according to the same principles, there are big differences in how the individual search engines harvest, index and rank Web pages. The problem for users of search engines is that only „few details concerning system architecture, retrieval models, and query-execution strategies are available for commercial search tools. The cause of preserving proprietary information has promulgated the view that developing Web search tools is esoteric rather than rational“, as Gudivda et al. (1997, 58) state. This point of view is shared by Wheatley and Armstrong (1997, 206) who complain about the lack of documentation on how Web pages are indexed for subsequent searching. They examined 18 search engine studies and have found that „information on how Web sites are selected for inclusion in these vast catalogues is scarce; some spiders or crawlers simply follow ‘trails’ of URL links and it is clear that no selection criteria are applied“ (Wheatley/Armstrong 1997, 206). The online help pages of most search engines, which are often the only source of information available for users, do not give much information on the way the search engine works. They mostly offer tips on how to formulate the query (cf. Sherman 1998 who compares the help documentation of the most popular „Big Seven“ search tools AltaVista, Excite, HotBot, Infoseek, Lycos, Northern Light and Yahoo).

This lack of information is due to the fact that the algorithms of search engines are company secrets, as Bharat and Broder (1998a, 380) admit. Due to intense competition among commercial search services, the companies do not specify how their robots harvest (Tunender/Ervin 1998, 178). Therefore it is difficult to discover the optimum design for the promotion of a Web page, as Tunender and Ervin (1998, 178) rightly state. Only general issues for Web site promotion can be pointed out and general recommendations can be made.

Before focusing on these recommendations, it is necessary to warn of a phenomenon that is called Search Engine Persuasion (Laursen 1998, 43). Search Engine Persuasion consists of various activities that are aimed at achieving high rankings in the search engine results pages. High rankings are desirable especially for companies who do advertising for products on the Web and want to achieve maximum exposure to potential clients. Most of these Search Engine Persuasion activities are underhand techniques, some are devious, if not to say illegal. They reach from means to deceive the ranking algorithms of search engines to illegal use of trademark and company names. Laursen (1998) and Stanley (1997) give a survey about the most common techniques of Search Engine Persuasion. The most well-known technique is keyword spamming or spamdexing, i.e. one or more keywords are repeated over and over again in order to achieve a high ranking by search engines. Another technique is to place hidden text on the Web page, for instance by marking the text as comment with a comment tag or by using the same color for text and background, e.g. white text against a white background. In both cases the text is not visible on screen but is part of the source code of the Web page which is analyzed by search engines. A third technique is the use of keywords which bear no relation to the subject of the Web document. If one uses false but popular keywords which are often searched for, e.g. sex or Playboy, it is an effective way of boosting the hit rate of a Web page. When these Search Engine Persuasion activities became widespread, most search engines began to take active measures against it (Laursen 1998, 44; Pringle/Allison/Dowe 1998, 376) and ignore or downgrade Web pages which use those techniques (cf. section Spam in Search Engine Watch’s Search Engine Features for Webmasters). Therefore these activities no longer work with most search engines. Moreover Search Engine Persuasion is unethical and throws an unfavorable light on the author of a Web page and its content. Nevertheless these activities are quite popular and are often used by dubious Web consulting companies who offer to promote their customers‘ Web sites by secret techniques.

For Web authors there are other ways to promote their Web pages. It is important to include these measures in an early stage of the design process of Web pages and keep retrieval techniques in mind while designing the Web pages (Tunender/Ervin 1998, 178). Therefore these measures are called proactive Web design (Andrews/Schweibenz 1997). Proactive Web design includes all measures that can be taken to increase the retrievability of Web pages either before or after publishing the page on the Web. The following recommendations for proactive Web design are based on the research of Maze, Moxley, and Smith (1997), Tunender and Ervin (1998), Sonnenreich and Macinta (1998a), Laursen (1998), Pringle, Allison, and Dowe (1998), and Sullivan’s service Search Engine Watch (1999b). There is no guarantee that these recommendations give a Web page a boost to the Top Ten ranks of search engines but, as retrieval tests show, they make it very likely that a Web page is picked up and favorably ranked by search engines.

There are ten recommendations for promoting your Web page:

  1. Submit your Web page to search services
  2. Link your Web page with related Web pages
  3. Provide indexable information for search engines
  4. Use a flat structure for your Web site
  5. Use a meaningful Uniform Resource Locator
  6. Use an informative title for your Web page
  7. Repeat important title words, especially in headings and the first sections of the page
  8. Do not use tricks in order to try to achieve a higher ranking
  9. Use Metadata
  10. Test the retrievability of your Web page

top of page

The recommendations will now be described in detail.

1 Submit your Web page to search services

If you consider the size and the growth of the Web, it is obvious that you cannot wait for your Web page to be picked up by chance. It is important to let the search services know that your page is there. So the most important step is to register your page with all search services you think important, no matter if search engines or classified lists. Almost every search service offers an option to submit a Web page. The submission is mostly done with submission forms. The links that lead to these forms are regularly placed on the front page of the search service and are often named „Add a URL“ or „Suggest your site“. With search engines you mostly have to enter your email address and the Uniform Resource Locator of the page you want to submit, whereas with classified lists you often have to give a short description of the page content and name a category where it fits in. If you do not want to do the submission by yourself, you can use commercial submission services who register your page automatically with dozens or hundreds of search engines.

It is important to keep in mind that registering your Web site with a search engine is no guarantee that all of your Web pages will be completely indexed by the search engine. Because of the huge size of the Web a lot of search engines try to offer a representative selection from the Web and therefore index Web sites only partially (Brake 1997, WWW).

Another important thing to think of is to re-submit your Web page after each change (Laursen 1998, 44) because changes may influence the indexing and ranking by search engines.

top of page

2 Link your Web page with related Web pages

Search engines as well as human searchers traverse the Web by following hyperlinks. Therefore linking your Web pages with those of other Web authors makes it more likely that your pages are found both by search services and Web users. So ask the owners of related pages to link to your page and offer a link in return. In this way your page becomes part of a network of links which makes it more likely that it is retrieved.

Apart from being found, being linked to other pages has another advantage. According to Sherman (1999, 57), a link to another Web page is almost like a citation in a book. It shows that the Web author thinks this page being so important that he links to it. Search engines plan to make use of this kind of peer review. Future search engine technology will try to locate authoritative sources on the Web, measure their link importance and use it in relevance ranking (Sherman 1999, 57). Often a similar way of rating the importance of Web pages is already used in the form of measuring link popularity, i.e. the number of links that refer to a certain page (cf. section Ranking/Link Popularity in Search Engine Watch’s Search Engine Features for Webmasters).

top of page

3 Provide indexable information for search engines

The report of Lawrence and Giles refers to the indexable Web and the report of Bharat and Broder to the static public Web. The terms indexable and static indicate that there must be parts of the Web that are static and public and can be indexed while there are others that are not static and not public and cannot be indexed by search engines. Public pages of the Internet are all Web pages that can be freely accessed while non-public pages are pages with restricted access, e.g. Intranets of companies and organizations. Static and indexable pages are all Web pages that exist as HTML files and can be displayed as a page by a browser. According to Nielsen (1996b, WWW), the fundamental design idea of the Web is based on the page as the atomic unit of information. The concept of static pages which are linked with other pages was the original design idea of the Web and contributed to its ease of use. For search engines, the static page which existed as a HTML file linked with other HTML files by hyperlinks was the basic architectural concept of the Web which made it easy to access, harvest, and index information. But static files are only one form of presenting information on the Web. There are also frames and dynamic pages.

Frames is a technology that divides a Web page into different parts, usually a bar at the top of the page, a bar on the left side of the page, and the remaining part of the page. The top bar often describes the content of the page, e.g. a company’s name or a product name, and the left bar generally contains navigation elements (for Western cultures). These bars are constantly displayed on the screen and serve as a frame for the information that is displayed in the remaining part(s) of the page. There are different points of views on frames from the user’s perspective. A lot of Web designers like frames because they give a page a professional look, can be used for corporate design, and make navigation easy. But there are troubles with frames as far as printing, bookmarking and search engines are concerned. Many browsers cannot print framed pages appropriately, as Nielsen points out (1996b, WWW). Instead of the full page they print a single frame. A similar problem exists with bookmarks. The content of a framed page changes according to the links you follow but as far as the browser is concerned, you are still sitting on the original page’s Uniform Resource Locator (URL) (Sonnenreich/Macinta 1998a, 64). Therefore it is not possible to bookmark a framed page in the usual way, because there exists no URL that describes the particular combination of frames the browser shows in a certain moment (Sonnenreich/Macinta 1998a, 64). In order to bookmark a framed page correctly, the browser would have to store a description of the contents of each frame window along with the URL and in this way bookmark a Web page that never existed anywhere. Search engines face a similar problem with frames: „A search engine would have to store large amounts of extra data just to handle any of one set of frames.“ (Sonnenreich/Macinta 1998a, 64) Therefore, most search engines follow the links within framed pages, but ignore the frames themselves (cf. Sullivan 1999a). They save each link directly but do not harvest the frame set because this would be too much information to deal with. Therefore a lot of search engines refuse to crawl framed pages (cf. section Crawling/Frames Support in Search Engine Watch’s Search Engine Features for Webmasters). So if you use frames, make sure to provide information on your framed pages in metadata or provide static pages that contain noframe information (cf. Search Engine Design Tips in Search Engine Watch).

Dynamic pages are virtual spaces that do not exist as HTML files. According to Stross (1996, 229-30) these virtual spaces include image maps, query-based forms as front-ends to databases and dynamic web pages generated on the fly by server-side applications. Because these virtual spaces do not exist as static HTML files, search engines are confused by it and ignore it (Stross 1996, 228-29; Sullivan 1999a, 35; cf. the section Crawling/Image Maps in Search Engine Watch’s Search Engine Features for Webmasters). Therefore static pages and metadata should provide detailed information about the content of those virtual spaces that is indexable for search engines (cf. Search Engine Design Tips in Search Engine Watch).

When talking about indexable Web information, last but not least the multimedia content of the Web has to be dealt with. The Web allows the use of different media, e.g. text, images, audio, and video. Although the help documentation of several search engines state that the search engines index multimedia content, this is not quite correct. What the search engines do, is to index a text tag that accompanies the multimedia files. This text tag contains a textual description of the file and was originally intended for browsers that could not display certain kinds of files. For instance, the HTML tag for an image shows how the text tag is used. The image tag consists of two parts, the image part which contains the URL or the file name of the image to be displayed and the alternative textual description, e.g. <IMG SCR= „URL or file name of the image file“ ALT=“textual description of the image“>. As search engines cannot analyze multimedia content, textual description or metadata is necessary for indexing by search engines.

These examples show that it is necessary to provide indexable information for search engines if you use frames, dynamic Web pages and multimedia content.

top of page

4 Use a flat structure for your Web site

Most search engines use a strategy that is called breadth-first approach. According to Stross (1996, 224) this strategy works as follows. The search engine starts from a designated node and works outwards. Each time it reaches a new node, it checks the node for Uniform Resource Locators and adds them to the top of a list of nodes to visit. After the current node is finished, it takes a new node off the bottom of the list and starts a new search process. In this way, the top levels of Web sites are visited and a wide range of nodes is covered before the search engine goes deeper. Stross (1996, 224) calls this strategy most efficient for a single robot because it attempts to cover the broadest possible number of servers on the Web while it causes minimum disruption of the Web server. While crawling the Web, some search engines only visit the top levels of Web sites, while others perform the so called Deep Crawl (cf. the section Crawling in Search Engine Watch’s Search Engine Features for Webmasters). Deep Crawl means that the search engine automatically registers pages from a Web site that were not explicitly submitted to it. But according to the findings of Tunender and Ervin (1998, 177) it takes a considerable amount of time before search engines crawl deeper into the hierarchical structure of a Web site. Their experiment lasted for 46 days and only two of the five search tools they investigated went deeper than the first level (Excite and Infoseek).

These findings suggest that it is better to use a flat structure for a Web site instead of a deep structure. It seems more likely that Web pages within a flat structure are found earlier and more easily than pages within a deep structure.

top of page

5 Use a meaningful Uniform Resource Locator

Often the address of a Web page, the Uniform Resource Locator (URL), is a complex string of characters and numbers. Although this machine-level addressing system should have never be shown in the user interface, according to Nielsen (1996a, WWW), user try to decode it. Therefore it should contain meaningful names of directory and file names that describe the information it contains. Apart from the user’s point of view it is also important to use meaningful URLs from the search engine’s perspective. The research of Maze, Moxley and Smith (1997, 24) indicates that the words in the URL are indexed by search engines and the findings of Pringle, Allison, and Dowe (1998, 371) suggests that the number of times the keyword occurs in the URL has some influence on the ranking by search engines. Therefore it is important to use a meaningful URL for a Web page.

top of page

6 Use an informative title for your Web page

Words within the <TITLE> tag of a the HTML document form the title of a Web page that is visible in the top line of the Web browser. Moreover they appear as the name of the electronic bookmark of the page. Therefore the title is an important means of orientation for users and should be informative and represent the intention and the content of your page. Apart from the user’s perspective, the title of a Web page is also important for search engines. A lot of search engines display the title in the results pages they present to the user (Maze/Moxley/Smith 1997, 26). Apart from that, the title is important for the retrieval by search engines, as the research of Tunender and Ervin (1998, 178) shows. All search engines in their study (AltaVista, Excite, Infoseek, and Lycos) successfully harvested the title tag. This indicates that Web designers can increase the retrieval of Web pages by better utilizing the title. These findings are supported by Pringle, Allison, and Dowe (1998, 371) who found that the number of times the keyword occurs in the document title is important for the ranking of Web pages. Additionally, Maze, Moxley, and Smith (1997, 26) point out that certain search engines work with semantic indexing techniques for weighing and ranking and to compare words from the title with words in the text of a Web page, in order to compute the one or two most statistically important sentences in that page. These sentences are then used as a summary in the results page. Therefore the title of your Web page should be informative and give a good description of the content of the page.

top of page

7 Repeat important title words, especially in headings and the first sections of the page

Writing for the Web is in some aspects different from writing for publication in print. This is due to the electronic environment in which the Web page is presented. If you take retrieval techniques in mind while writing for the Web, you should consider to use important terms in full more often than you would in normal writing, rather than using pronouns and other indirect ways of referring to subjects because for statistical analysis it is important how often a word occurs in the document (Pringle/Allison/Dowe 1998, 371 and 376). The statistical analysis is important because of the way search engines weigh terms in documents. Search engines do not only look for words but also for the context in which the words are used (Maze/Moxley/Smith 1997, 26). Therefore it is important to create a context for weighing and ranking of terms because weighted word indexing seems to be more prevalent than thought, as Tunender and Ervin (1998, 178) state.

Besides using important words more often, they should also be used in the headings of the page. According to Pringle, Allison, and Dowe (1998, 371) search engines do not only measure how often keywords occurs in the whole document including title and metadata, they also examine the number of times keywords occur in the first heading tag. Therefore important words should be repeated in the headings.

Additionally, important words should be used in the first section of a Web page because some search engines extract the first 50 words of a Web page and use it as a summary (Maze/Moxley/Smith 1997, 26). Therefore it is a good idea to put an abstract or a summary in the first section of a Web page. According to the research of Wheatley and Armstrong (1997, 212), Internet search tools make good use of the title and first paragraph for creating a summary.

top of page

8 Do not use tricks in order to try to achieve a higher ranking

It is not a good idea to use means of search engine persuasion in order to achieve a higher ranking. These tricks do not work very well because most search engines use countermeasures (Pringle/Allison/Dowe 1998, 376) and some even penalize pages with spamming or other tricks (Laursen 1998, 44). Moreover your Web page does not appear in a favorable light if you use tricks like search engine persuasion.

top of page

9 Use Metadata

Metadata is data about data or objects. A good example for metadata is catalog cards in a library catalog. On the Web, metadata are used as special META tags in HTML where the web author can, for example, give a description of the page with the META tag description or name keywords for indexing by search engines with the META tag keywords. We will deal with metadata in greater detail in the next section.

A lot of search engines use metadata to extract the summary of the Web page that is shown in the results list and use the keywords for indexing (Laursen 1998, 44; Sonnenreich/Macinta 1998a, 23 and 31). Most search engines harvest and index metadata (cf. section Crawling/Meta robots tag in Search Engine Watch’s Search Engine Features for Webmasters). Some give a boost in ranking to Web pages that contain metadata (cf. section Ranking in Search Engine Watch’s Search Engine Features for Webmasters). Others ignore metadata completely or partly, e.g. Excite refuses to use the META tag description due to spamming reasons (Sonnenreich/Macinta 1998a, 39) and Web Crawler ignores the META tag keywords (Sonnenreich/Macinta 1998a, 51). But no matter what kind of metadata policy search engines use, metadata are an important device to enhance the accessibility of your Web page’s content for search engines because you can use it to describe multimedia content, dynamic, framed, and static Web pages.

top of page

10 Test the retrievability of your Web page

Most search engines offer the possibility to check if your Web site is registered with them. Usually you enter the Uniform Resource Locator of your Web site in a form that checks it against the index of the database. As usual, all search engines offer this service in a different way (cf. Checking Your URL in Search Engine Watch).

Apart from search engines there are commercial services that offer to check search engine databases for URLs, keywords and titles. These URL Checking Services usually offer not only the checking of URLs but also the submission of your Web site to dozens or hundreds of search engines. One of these services is „Rank This!“ (Internet, URL = http://www.rankthis.com/). It allows you to search eight different search tools, among them the search engines AltaVista, Excite, HotBot, Infoseek and the Web directory Yahoo, for your Web page by using keywords or title of the page. Moreover it shows how your page is ranked for the keywords or title in relation for other Web pages, e.g. among the Top Ten or among the first hundred hits. This free service allows you to adapt your proactive Web design measures if necessary. Another free service for checking metadata is NorthernWeb’s tool Meta Medic that checks only the META tags description and keywords of a Web page and evaluates the use of metadata (Internet, URL = http://www.northernwebs.com/set/setsimjr.html).

top of page

A final comment on the recommendations for proactive Web design

As already stated, these recommendations cannot guarantee that your Web pages will be placed among the Top Ten results of a search engine because there is too little known about the way individual search engines work. Nevertheless the research of Laursen (1998), Pringle, Allison, and Dowe (1998) as well as Tunender and Ervin (1998) shows that the recommendations of proactive Web design work pretty well in improving the retrievability of Web pages. In order to get your Web page picked up by search engines, it is important to take retrieval techniques into account while you design your Web site, as Tunender and Ervin (1998, 178) rightly state. As a Web designer you should never forget that information design is more than what is seen in the Web browser.

top of page

6 How to Use Metadata for Resource Description on the Web

The importance of metadata was already emphasized in the recommendations for proactive Web design. Now we want to take a closer look on how Web metadata are used. As the application of metadata for the Web is a complex matter, we cannot go into all the details but only give an overview of the most important features of HTML metadata and the Dublin Core Element Set. For further information see Miller (1996), Rusch-Feja (1998) and Weibel (1995; 1999) and Weibel/Hakala (1998).

Metadata is data about data or objects. A good example for metadata is a library catalog because it contains library data that describe the media stored in the library following certain cataloging rules. Metadata is a form of „electronic cataloging rules“ (Rusch-Feja 1998, WWW) and can be used to describe digitized and non-digitized resources, e.g. electronic documents, books, or objects. On the Internet metadata refers to any data that helps to identify, describe and retrieve electronic resources of any kind (text, video, audio, etc.). On the Web the use of metadata is necessary because although all the information on the Web is machine-readable, not all information is machine-understandable, as Lassila (1998, 30) puts it. The problem is that the Web is largely unstructured and chaotic and that the Hypertext Markup Language HTML which is used for the design of Web pages offers only few mechanisms to store information about data (Sherman 1999, 57). Search engines can read the HTML code of a Web page, but they cannot understand natural language and cannot extract specific information from documents, such as author or topic. Therefore, search engines need metadata as „machine-understandable descriptions of Web resources“ (Lassila 1998, 31). Web metadata can be used for various purposes such as cataloging, software agents and resource discovery, content rating, electronic commerce, digital signatures, privacy, and describing intellectual property rights (cf. Lassila 1998, 31).

For our purpose we focus on resource discovery. According to Milstead and Feldman (1999, WWW) metadata is crucial to searching because retrieval on the Web is largely a matter of matching query words with words in indexes and metadata help to improve matching by offering a standardized structure and content for indexing information. Moreover metadata help to describe non-textual multimedia data that cannot be indexed by text-based search engines.

Although it offers great promise, the problem with metadata is that its application is often complicated. Miller (1996, WWW) and Weibel (1995, WWW) give several examples for complex and highly standardized metadata formats, e.g. Text Encoding Initiative or MARC (MAchine Readable Catalogue), that are used to describe electronic resources in different fields. Because of the size of the Web is impossible to let expert catalogers or indexers create metadata for Web resources except for very important ones. Therefore both authors conclude that Web metadata must be of a form that both search engines and human beings can interpret and should be so easy to use that any Web author is able to describe the content of his or her page with it. For this reason HTML metadata was introduced in the HyperText Markup Language.

The use of metadata for the Web started with the HTML 2.0 specifications which included the HTML META tag. The META tag was defined in HTML 3.2 as a header element that consist of name/value pairs which could describe the properties of the document (Laursen 1988, 44), e.g. author, description, keywords (for the integration of Dublin Core Metadata in HTML 4.0 cf. Kunze 1999). META tags are located in the header of the Web page (between the tags <HEAD> and </HEAD>) and are not displayed by the browser but can be viewed in the source code of the page by selecting the option View/Source in the browsers Netscape Communicator and Microsoft Internet Explorer. As META tags are part of the HTML code, almost every search engine harvests and indexes HTML META tags (cf. section Crawling/Meta robots tag in Search Engine Watch’s Search Engine Features for Webmasters). But there are big differences in the way the search engines make use of metadata. Sonnenreich and Macinta (1998a, Chapter 2 Overview of the Major Search Engines) describe in detail how major search engines use the META tags keywords and description. For example, AltaVista, HotBot, and Infoseek use the META tags keywords and description in order to index the Web page according to the terms used in the META tag keywords and to describe the content of the Web page in the results list’s summery with the META tag description, whereas Excite ignores the META tag description and Web Crawler ignores the META tag keywords. Additionally, some search engines use their own metadata syntax which corresponds with the sorting criteria of their index. Therefore Miller (1996, WWW) makes a distinction between what he calls the search engine approach to metadata and Dublin Core metadata. Following Miller we will distinguish between search engine metadata and Dublin Core metadata because there are some important differences between these two metadata formats. For search engine metadata there is no standard set of elements and metadata is only fully used if it is in the syntax recommended for that particular search engine (Miller 1996, WWW) whereas Dublin Core metadata has a basic definition of 15 elements (Version 1.0) which has been stable since December 1996 and is now on its way to formalization and standardization (Weibel 1999, WWW). Both formats are used on the Web and are embedded in HTML code in the same way as the example shows for the basic HTML metadata elements.

 
<HTML>
<HEAD>
<TITLE> How To Use General Design Issues and Metadata In Order To Get Your Web Page Picked Up By Search Engines</TITLE>
<META NAME=“author“ CONTENT=“Werner Schweibenz“>
<META NAME=“keywords“ CONTENT=“search engines, information retrieval on the Web, proactive Web design, Web site promotion, metadata, Dublin Core“>
<META NAME=“description“ CONTENT=“The paper gives a survey on Web search services, deals with size of the Web and its coverage by search engines, gives recommendations for proactive Web design and how to promote your Web page by using general design issues and metadata.“>
<META NAME=“robots“ CONTENT=“index,follow“>
</HEAD>
<BODY>
… Text …
</BODY>
</HTML>

This example contains only the basic metadata elements that are harvested by most search engines. The obvious difference in the markup of search engine metadata and Dublin Core metadata is the abbreviation DC in front of the name element of the name/value pair. So in Dublin Core the META tag <META NAME=“author“ CONTENT=“name“> would be <META NAME=“DC.Author“ CONTENT=“name“>. But this is not the only difference, there are also different numbers and coverage of the elements. In order to show the differences and provide examples, both search engine and Dublin Core metadata for this paper will be created with the help of metadata templates.

As already mentioned, search engine metadata is not standardized and its use varies from search engine to search engine. Therefore it is not possible to list all the different META tags search engines use. Instead we will focus on the most wide-spread search engine META tags as they are generated by popular metadata templates. Metadata templates are Web-based forms that automatically generate HTML META tags. There are numerous metadata templates on the Web, a lot of them are offered free of charge by search engine companies and Web consulting companies in order to attract customers. Most of these templates generate search engine metadata only. An example is the service provided by Websitepromote, a company that offers a Meta Tag Generator (Internet, URL = http://www.websitepromote.com/resources/meta/). The Meta Tag Generator offers the following boxes: Title of Web Page (64 characters maximum), Up to 7 Keywords (100 characters maximum, separated by commas or spaces in order of importance), Your email address, and Description (200 characters maximum). The META tags are created by clicking the button Create Meta and can be inserted into the HTML code of the Web page by cut & paste. But these are only very basic metadata.

An example for a more comprehensive service is Vancouver Webpages’ Meta Builder (Internet, URL = http://vancouver-webpages.com/META/mk-metas.html). It offers metadata templates both for search engine metadata and Dublin Core metadata (although in a „lite“ version only, for the individual elements selected from Dublin Core see http://vancouver-webpages.com/META/DC.lite.html). We want to take a closer look at the search engine metadata before we deal with the Dublin Core metadata.

The search engine metadata created with the Vancouver Webpages Template is more comprehensive as the one of Websitepromote. It contains 16 elements, some of them being a „lite“ version of Dublin Core. The metadata elements and how they are used by search engines are explained on the Web page About Meta Builder (Internet, URL = http://vancouver-webpages.com/META/about-mk-metas2.html). For our example we will list all 16 elements, even though they are not all used (comments are included in italics).

  1. Title
    How To Use General Design Issues and Metadata In Order To Get Your Web Page Picked Up By Search Engines
    (This box is for the title of the Web page as in the HTML title tag.)

  2. Description
    Paper presented at the Second International Workshop Exploring a Communication Model for Web Design, Seattle, WA, July 10-17, 1999. The paper gives a survey on Web search services, deals with size of the Web and its coverage by search engines, gives recommendations for proactive Web design and how to promote a Web page by using general design issues and metadata.
    (The box contains a summary of the page’s content. Most search engines use the description for display on the results list. Therefore the description should give a concise and informative report comparable to an abstract in journals or online databases.)

  3. Keywords
    search engines, information retrieval on the Web, proactive Web design, Web site promotion, Web metadata, Dublin Core, Workshop Exploring a Communication Model for Web Design
    (The box lists index terms for search engine indexing. Most search engines refer to keywords for indexing, therefore it is important to select adequate keywords.)

  4. Subject
    search engines, information retrieval on the Web, proactive Web design, Web site promotion, Web metadata, Dublin Core, Workshop Exploring a Communication Model for Web Design
    (The box contains a DC.lite element, the topic of the resource, or keywords or phrases that describe the subject or content of the resource. Subject is basically the same as keywords, the only difference is the name part of the name/value pair that is marked as a DC element.)

  5. Creator
    Werner Schweibenz
    (The box contains the name of the author who created the page.)

  6. Publisher
    Department of Information Science (Fachrichtung Informationswissenschaft), University of Saarland
    (The box names the publisher of the Web page, e.g. an organization or a company.)

  7. Contributors
    contains no data
    (The box can name a person or institution who made important contribution to the Web page.)

  8. Coverage.PlaceName
    contains no data
    (The box describes the spatial coverage of the Web page’s content. As there is no direct relation to a certain place, the box contains no data.)

  9. Coverage.xyz:(degrees,minutes,seconds; decimals allowed)
    contains no data
    (The box describes the geographic location of the resource, if meaningful. x is Longitude, y is Latitude, in decimal degrees, z is elevation in meters. As there is no direct relation to a certain place, the box contains no data.)

  10. Owner
    w.schweibenz@rz.uni-sb.de
    (The box contains legacy value, comparable with a copyright notice. The email address of the owner is specified.)

  11. Expires
    contains no data
    (The box names the date after which a page is considered out of date.)

  12. Charset (Recommended if not ISO8859-1)
    contains no data
    (The box offers a selection menu for describing the content type of the Web page, e.g. Content-type: text/html; charset=iso-8859-5.)

  13. Language and Dialect (country)
    English – US
    (The box names the language the content of the Web page is written in. The first selection menu allows to choose the language, the second menu allows a specification of the dialect, e.g. Canadian English or American English. The language is specified using a 3-character string from Z39.53. EN stands for English.)

  14. Robots: Controls Web robot traversal NOFOLLOW NOINDEX
    Empty – All
    (The box contains orders to robots how they should treat the content and the links of the Web page. The default value is empty and means ALL. The terms for robot control are ALL, NONE, INDEX, NOINDEX, FOLLOW, NOFOLLOW. ALL means that search engines shall index the page and follow the links, whereas NONE ask search engines to leave the page alone. INDEX means that robots are welcome to include this page in search services. FOLLOW means that robots are welcome to follow links from this page to find other pages.)

  15. Object Type
    Document
    (The box offers a selection menu for describing the type of the Web page, e.g. document, dictionary, or database. It puts the document in a particular category that can be to be searched for.)

  16. Rating
    General
    (The box offers a selection menu for describing the content of the Web page as General, Mature, Restricted, or 14 years. This scheme allows the rating of HTML and other documents for age-sensitive material and is part of PICS (Platform for Internet Content Selection.))

The metadata are generated by clicking the button Create HTML at the bottom of the page and the META tags created by the template are displayed on screen:

your META tags

The beginning of your HTML document (root page if using frames) should look like this (clip or save as TEXT):

<HTML LANG=en-US>
<HEAD PROFILE=“http:/BRurl.org/metadata/dublin_core“>
<TITLE>How To Use General Design Issues and Metadata In Order To Get Your Web Page Picked Up By Search Engines</TITLE>
<LINK REV=made href=“/mailto:w.schweibenz@rz.uni-sb.de“>
<META NAME=“keywords“ CONTENT=“search engines, information retrieval on the Web, proactive Web design, Web site promotion, Web metadata, Dublin Core, Workshop Exploring a Communication Model for Web Design“>
<META NAME=“description“ CONTENT=“Paper presented at the Second International Workshop Exploring a Communication Model for Web Design, Seattle, WA, July 10-17, 1999. The paper gives a survey on Web search services, deals with size of the Web and its coverage by search engines, gives recommendations for proactive Web design and how to promote a Web page by using general design issues and metadata.“>
<META NAME=“rating“ CONTENT=“General“>
<META NAME=“VW96.objecttype“ CONTENT=“Document“>
<LINK REL=SCHEMA.VW96 href=“/http://vancouver-webpages.com/META/VW96-schema.html“>
<META NAME=“ROBOTS“ CONTENT=“ALL“>
<META NAME=“DC.Title“ CONTENT=“How To Use General Design Issues and Metadata In Order To Get Your Web Page Picked Up By Search Engines“>
<META NAME=“DC.Creator“ CONTENT=“Werner Schweibenz“>
<META NAME=“DC.Subject“ CONTENT=“search engines, information retrieval on the Web, proactive Web design, Web site promotion, Web metadata, Dublin Core, Workshop Exploring a Communication Model for Web Design“>
<META NAME=“DC.Description“ CONTENT=“Paper presented at the Second International Workshop Exploring a Communication Model for Web Design, Seattle, WA, July 10-17, 1999. The paper gives a survey on Web search services, deals with size of the Web and its coverage by search engines, gives recommendations for proactive Web design and how to promote a Web page by using general design issues and metadata.“>
<META NAME=“DC.Publisher“ CONTENT=“Department of Information Science (Fachrichtung Informationswissenschaft), University of Saarland „>
<META NAME=“DC.Language“ SCHEME=“RFC1766″ CONTENT=“EN“>
<!– Metadata generated by http://vancouver-webpages.com/META/mk-metas.html –>
</HEAD>

Note: Certain HTML editors mangle META elements; you may wish to use a simple editor like Notepad or vi to add these.

You may generate the following HTTP headers using a CGI script or otherwise (e.g. Apache .meta file) in your server:

Content-language: en-US

The metadata created with the Vancouver Webpages’ Meta Builder contain a mix of search engine META tags and Dublin Core META tags. But as it is only a „lite“ version with some selected elements of the Dublin Cure, now we have to deal with the complete Dublin Core Element Set.

„The Dublin Core initiative is an international and interdisciplinary effort to define a core set of elements for resource discovery“, explains a definition given by Weibel and Hakala (1998, WWW) who offer an introduction into the history of the Dublin Core Element Set and the outcomes of the first five Dublin Core Workshops. According to Weibel (1995, WWW) the Dublin Core Element Set „was proposed as the minimum number of metadata elements required to facilitate the discovery of document-like objects in a networked environment such as the Internet“. The semantics of the element set was developed to meet the following principles, as Weibel explains: the element set will be kept as small as possible, the meanings of the elements will be kept as simple as possible in order to be understood by a wide range of users, and the element set will be flexible enough for the description of resources in a wide range of subject areas. All elements are optional, repeatable and modifiable. This makes the Dublin Core notably different from many other metadata schemes, as Miller (1996, WWW) emphasizes, because it is relatively easy to use and to interpret even for users who are no experts.

The Dublin Core Element Set in the version 1.0 as of 1997 (also named the „Finnish Finish“ after the workshop in Helsinki, Finland) consists of 15 elements which are described in detail on the Dublin Core Home Page (Internet, URL = http://Purl.org/DC/about/element_set.htm).

For Dublin Core Metadata there are also numerous metadata templates available, e.g. the Dublin Core Metadata Template provided by the Nordic Metadata Project (http://www.lub.lu.se/cgi-bin/nmdc.pl?lang=en&save-info=on&simple=1). It is maintained by the library of the University of Lund, Sweden, as a free service for participants of the Nordic Metadata Project (Internet, URL = http://linnea.helsinki.fi/meta/). The template offers a minimal version of Dublin Core metadata which can be completed by adding additional boxes for the 15 standardized fields of the Dublin Core Element Set by clicking on the element descriptions. For our example we will list all 15 elements even though they are not all used (comments are included in italics).

  1. Title
    How To Use General Design Issues and Metadata In Order To Get Your Web Page Picked Up By Search Engines
    (This box contains the title of the Web page as in the HTML title tag. It is possible to add an alternative title, for example in a foreign language or a short title for easier use.)

  2. Creator
    Werner Schweibenz
    Creator Address
    w.schweibenz@rz.uni-sb.de
    (The box contains the name of the author who created the page. An optional box contains the author’s email address for communications to the author.)

  3. Subject: Keywords
    search engines, information retrieval on the Web, proactive Web design, Web site promotion, Web metadata, Dublin Core, Workshop Exploring a Communication Model for Web Design
    (The box lists index terms for search engine indexing. Most search engines refer to keywords for indexing. Each keyword has to be placed in an extra box for precise indexing. If necessary, additional boxes can be added by clicking on the plus sign next to the box. There are additional keyword boxes for Controlled vocabulary and Classification according to various cataloging rules.)

  4. Description
    Paper presented at the Second International Workshop Exploring a Communication Model for Web Design, Seattle, WA, July 10-17, 1999. The paper gives a survey on Web search services, deals with size of the Web and its coverage by search engines, gives recommendations for proactive Web design and how to promote a Web page by using general design issues and metadata.
    (The box contains a summary of the page’s content. Most search engines use the description for display on the results list. Therefore the description should give a concise and informative report comparable to an abstract in journals or online databases.)

  5. Publisher
    Department of Information Science (Fachrichtung Informationswissenschaft), University of Saarland
    (The box names the publisher of the Web page, e.g. an organization or a company. It is possible to add a box with the publisher’s email address for communication.)

  6. Contributor
    contains no data
    (The box can name a person or institution who made important contribution to the Web page. If it contains no data, the box can be deleted by clicking the minus sign next to the box.)

  7. Date
    1999-06-28
    (The box contains the date the Web page was created. The date can be created in three different standards, ISO 8601, ANSI and RFC 822. In this case the ISO standard was used.)

  8. Type
    Text.Article
    (The box offers a selection menu for describing the type of the Web page, e.g. different kinds of text and image types. As the page consists of text and is an article, the type Text.Article was selected.)


  9. Format
    text/html (.htm, .html)
    (The box offers a selection menu for describing the data format of the Web page. As the page consists of text in HTML format, the format text/html was chosen.)

  10. Identifier: URL
    http://www.phil.uni-sb.de/fr/infowissBRrojekte/index.html
    (The box names the Web address of the page, the Uniform Resource Locator (URL). An additional box offers a Uniform Resource Name (URN) which allows a definite identification of a Web page independent form the URL. The URN is comparable to the ISBN of books or the ISSN for journals. Only participants of the Nordic Metadata Project can apply for a URN.)

  11. Source
    http://www.phil.uni-sb.de/fr/infowissBRrojekte/index.html
    (The box contains the source of the Web page. As there exists no identification code for the Web page like an ISBN or an ISSN, the URL is used.)

  12. Language
    English
    (The box names the language the content of the Web page is written in. A selection menu allows to choose the language.)

  13. Relation
    http://www.uwtc.washington.edu/workshop/1999/descriptionBRrogram.htm
    (The box states a relation of this Web page to another Web page, here the Web page of the workshop. The formal specification of relationship is currently under development.)

  14. Coverage
    contains no data
    (The box describes the temporal or spatial coverage of the Web page’s content. As there is no direct relation to a certain time or place, the box contains no data.)

  15. Rights
    Werner Schweibenz, w.schweibenz@rz.uni-sb.de
    (The box contains the copyright of the Web page. It can be free text or the URL or URN of a page that contains copyright statements. It makes sense to set up a separate HTML document which contains a copyright notice or a rights-management statement for the whole Web site or individual pages.)

The metadata are generated by clicking the button Return metadata at the bottom of the page.There are different choices possible, e.g. for preview, for inclusion in HTML-document, and for inclusion in HTML4-document. The META tags for inclusion in HTML-document created by the template are displayed on screen:

<!– For best results, you should include the HTML-coded –>
<!– metadata that you find further down the page –>
<!– (below the line) in the <HEAD></HEAD>-tag of your –>
<!– page. This will simplify correct indexing by robots. –>
<!– —————————————————- –>

<META NAME=“DC.Title“ CONTENT=“How To Use General Design Issues and Metadata In Order To Get Your Web Page Picked Up By Search Engines“>
<LINK REL=SCHEMA.dc href=“/http:/BRurl.org/metadata/dublin_core_elements#title“>
 
<META NAME=“DC.Creator“ CONTENT=“Werner Schweibenz“>
<LINK REL=SCHEMA.dc href=“/http:/BRurl.org/metadata/dublin_core_elements#creator“>
 
<META NAME=“DC.Creator.Address“ CONTENT=“w.schweibenz@rz.uni-sb.de“>
<LINK REL=SCHEMA.dc href=“/http:/BRurl.org/metadata/dublin_core_elements#creator“>
 
<META NAME=“DC.Subject“ CONTENT=“search engines“>
<LINK REL=SCHEMA.dc href=“/http:/BRurl.org/metadata/dublin_core_elements#subject“>
 
<META NAME=“DC.Subject“ CONTENT=“information retrieval on the Web“>
<LINK REL=SCHEMA.dc href=“/http:/BRurl.org/metadata/dublin_core_elements#subject“>
 
<META NAME=“DC.Subject“ CONTENT=“proactive Web design“>
<LINK REL=SCHEMA.dc href=“/http:/BRurl.org/metadata/dublin_core_elements#subject“>
 
<META NAME=“DC.Subject“ CONTENT=“Web site promotion“>
<LINK REL=SCHEMA.dc href=“/http:/BRurl.org/metadata/dublin_core_elements#subject“>
 
<META NAME=“DC.Subject“ CONTENT=“Web metadata“>
<LINK REL=SCHEMA.dc href=“/http:/BRurl.org/metadata/dublin_core_elements#subject“>
 
<META NAME=“DC.Subject“ CONTENT=“Dublin Core“>
<LINK REL=SCHEMA.dc href=“/http:/BRurl.org/metadata/dublin_core_elements#subject“>
 
<META NAME=“DC.Subject“ CONTENT=“Workshop Exploring a Communication Model for Web Design“>
<LINK REL=SCHEMA.dc href=“/http:/BRurl.org/metadata/dublin_core_elements#subject“>
 
<META NAME=“DC.Description“ CONTENT=“Paper presented at the Second International Workshop Exploring a Communication Model for Web Design, Seattle, WA, July 10-17, 1999. The paper gives a survey on Web search services, deals with size of the Web and its coverage by search engines, gives recommendations for proactive Web design and how to promote a Web page by using general design issues and metadata.“>
<LINK REL=SCHEMA.dc href=“/http:/BRurl.org/metadata/dublin_core_elements#description“>
 
<META NAME=“DC.Publisher“ CONTENT=“Department of Information Science (Fachrichtung Informationswissenschaft), University of Saarland“>
<LINK REL=SCHEMA.dc href=“/http:/BRurl.org/metadata/dublin_core_elements#publisher“>
 
<META NAME=“DC.Date“ CONTENT=“(SCHEME=ISO8601) 1999-06-29″>
<LINK REL=SCHEMA.dc href=“/http:/BRurl.org/metadata/dublin_core_elements#date“>
 
<META NAME=“DC.Type“ CONTENT=“Text.Article“>
<LINK REL=SCHEMA.dc href=“/http:/BRurl.org/metadata/dublin_core_elements#type“>
 
<META NAME=“DC.Format“ CONTENT=“(SCHEME=IMT) text/html“>
<LINK REL=SCHEMA.dc href=“/http:/BRurl.org/metadata/dublin_core_elements#format“>
<LINK REL=SCHEMA.imt href=“/http://sunsite.auc.dk/RFC/rfc/rfc2046.html“>
 
<META NAME=“DC.Identifier“ CONTENT=“http://www.phil.uni-sb.de/fr/infowissBRrojekte/index.html“>
<LINK REL=SCHEMA.dc href=“/http:/BRurl.org/metadata/dublin_core_elements#identifier“>
 
<META NAME=“DC.Source“ CONTENT=“http://www.phil.uni-sb.de/fr/infowissBRrojekte/index.html“>
<LINK REL=SCHEMA.dc href=“/http:/BRurl.org/metadata/dublin_core_elements#source“>
 
<META NAME=“DC.Language“ CONTENT=“(SCHEME=ISO639-1) en“>
<LINK REL=SCHEMA.dc href=“/http:/BRurl.org/metadata/dublin_core_elements#language“>
 
<META NAME=“DC.Relation“ CONTENT=“(SCHEME=URL) http://www.uwtc.washington.edu/workshop/1999/descriptionBRrogram.htm“>
<LINK REL=SCHEMA.dc href=“/http:/BRurl.org/metadata/dublin_core_elements#relation“>
 
<META NAME=“DC.Rights“ CONTENT=“Werner Schweibenz, w.schweibenz@rz.uni-sb.de“>
<LINK REL=SCHEMA.dc href=“/http:/BRurl.org/metadata/dublin_core_elements#rights“>
 
<META NAME=“DC.Date.X-MetadataLastModified“ CONTENT=“(SCHEME=ISO8601) 1999-06-29″>
<LINK REL=SCHEMA.dc href=“/http:/BRurl.org/metadata/dublin_core_elements#date“>

The metadata can be copied to the HTML file by using cut & paste. The META tags are marked as Dublin Core metadata by the abbreviation DC and are described by a so called LINK tag that refers to the Dublin Core home page and gives information were descriptions on the elements can be found. The last element is the date the metadata were last modified.

Practical issues of using metadata arise when the number of pages with metadata grows and a form of metadata management becomes necessary. Powell (1997, WWW) gives an overview of various issues related to the management of a huge number of Dublin Core records on a Web site. Powell favors the use of Dublin Core metadata over search engine metadata because they standardized. But in practice one will often find that both formats are used next to each other in the header of the same Web page, as for instance in the META tags created by the Vancouver Webpages‘ Meta builder. As metadata consist of text only, there are no disadvantages in applying both sets as far as the transmission time is concerned. As far as spamming is concerned, there should be no problems with downgrading by search engines due to the use of both sets as the Dublin Core metadata are clearly marked. As long as most search engine help documentation do not state if they use one or the other format or both of them, it seems reasonable to use both sets. An other question that remains undecided is, on which pages to implement metadata. As the research of Tunender and Ervin (1998, 177) shows, it takes a considerable amount of time before search engines crawl deeper into the hierarchical structure of a Web site. This suggests that it should be good enough to equip the Web pages on the first and second level with metadata and hope for the Deep Crawl function to discover the remaining levels. Nevertheless it makes sense to embed metadata in every Web page that is submitted directly to a search engine even if it is placed on a deeper level. But this can only be rules of thumb, because there is further research to be conducted on this specific topic.

Although metadata hold a great promise for information retrieval on the Web, there are still problems to solve. The major problem still is that the „actual use of metadata in the head area of Web pages is still uncommon and inconsistent“, as Wheatley and Armstrong (1997, 206) rightly state. Part of the problem is that search engines do not make predictable use of metadata. This problem will hopefully soon be solved because the Dublin Core Initiative cooperates with the Internet Engineering Task Force and the World Wide Web Consortium in order to create a standard for Dublin Core metadata and integrate it into the metadata architecture of the Resource Description Framework RDF (cf. Weibel 1999, WWW). The other problem is the lack of critical mass of Web metadata. Up to now, there are not enough Web designers who take advantage of META tags. According to Clark (1999, WWW), who does not name his sources, „statistics show that only about 21 percent of Web pages use keyword and description META tags“, while the search engine company Excite estimates that only 30 to 40 percent of Web publishers use META tags (Sherman 1999, 57-58). This shows that at the moment the limits of Web metadata is the critical mass of Web pages with metadata (for the problem of the critical mass and a possible solution by back-propagation see Marchiori 1998).

top of page

7 Conclusions

As the recommendations for proactive Web design show, there are ways to promote Web pages by use of general design issues and metadata. These recommendations are no guarantee that individual search engines will pick up a Web page and rank it high as there is too little known about the way individual search engines work. For the majority of search engines they will work reasonably well because they are based on the findings of various experiments. What is important to do, is to include these measures in an early stage of the design process of Web pages and keep retrieval techniques in mind while designing the individual Web pages. In order to make more sophisticated recommendations, further research is necessary, as Tunender and Ervin (1998, 178) rightly emphasize.

As for metadata, it looks like as if the Dublin Core Element Set will become a lingua franca for metadata, as Milstead and Feldman (1999, WWW) presume. As the Dublin Core will be integrated in the Resource Description Framework which is under construction, Dublin Core metadata has a great potential for improving information retrieval on the Web. What is still necessary for a significant improvement of Web searching is to encourage more Web authors to take advantage of metadata.

top of page

Aknowledgements

The author would like to thank James Andrews and David Moxley, both with the School of Information Science and Learning Technology, University of Missouri-Columbia, for sharing their knowledge in search engines and the Verein der Freunde der Universität des Saarlandes for sponsoring his participation in the workshop.

top of page

References

top of page