Google logo
Google Search Appliance Documentation

Administering Crawl
PDF Previous Next
Advanced Topics

Advanced Topics

Crawling is the process where the Google Search Appliance discovers enterprise content to index. The information in this chapter extends beyond basic crawl.

Back to top

Identifying the User Agent

Web servers see various client applications, including Web browsers and the Google Search Appliance crawler, as “user agents.” When the search appliance crawler visits a Web server, the crawler identifies itself to the server by its User-Agent identifier, which is sent as part of the HTTP request.

The User-Agent identifier includes all of the following elements:

User Agent Name

The default user agent name for the Google Search Appliance is “gsa-crawler.” In a Web server’s logs, the server administrator can identify each visit by the search appliance crawler to a Web server by this user agent name.

You can view or change the User-Agent name or enter additional HTTP headers for the search appliance crawler to use with the Content Sources > Web Crawl > HTTP Headers page in the Admin Console.

User Agent Email Address

Including an email address in the User-Agent identifier enables a webmaster to contact the Google Search Appliance administrator in case the site is adversely affected by crawling that is too rapid, or if the webmaster does not want certain pages crawled at all. The email address is a required element of the search appliance User-Agent identifier.

For complete information about the Content Sources > Web Crawl > HTTP Headers page, click Admin Console Help > Content Sources > Web Crawl > HTTP Headers in the Admin Console.

Back to top

Coverage Tuning

You can control the number of URLs the search appliance crawls for a site by using the Content Sources > Web Crawl > Coverage Tuning page in the Admin Console. To tune crawl coverage, a URL pattern and setting the maximum number of URLs to crawl for it. The URL patterns you provide must conform to the Rules for Valid URL Patterns in Administering Crawl.

For complete information about the Content Sources > Web Crawl > Coverage Tuning page, click Admin Console Help > Content Sources > Web Crawl > Coverage Tuning in the Admin Console.

Back to top

Freshness Tuning

You can improve the performance of a continuous crawl using URL patterns on the Content Sources > Web Crawl > Freshness Tuning page in the Admin Console. The Content Sources > Web Crawl > Freshness Tuning page provides four categories of crawl behaviors, as described in the following table. To apply a crawl behavior, specify URL patterns for the behavior.

 

Crawl Frequently

Use Crawl Frequently patterns for URLs that are dynamic and change frequently. You can use the Crawl Frequently patterns to give hints to the search appliance crawler during the early stages of crawling, before the search appliance has a history of how frequently URLs actually change.

Any URL that matches one of the Crawl Frequently patterns is scheduled to be recrawled at least once every day. The minimum wait time (see Wait Times) is 15 minutes, but if you have too many URLs in Crawl Frequently patterns, wait time increases.

Crawl Infrequently

Use Crawl Infrequently Patterns for URLs that are relatively static and do not change frequently. Any URL that matches one of the Crawl Infrequently patterns is not crawled more than once every 90 days, regardless of its Enterprise PageRank or how frequently it changes. You can use this feature for Web pages that do not change and do not need to be recrawled. You can also use it for Web pages where a small part of their content changes frequently, but the important parts of their content does not change.

Always Force Recrawl

Use Always Force Recrawl patterns to prevent the search appliance from crawling a URL from cache (see Determining Document Changes with If-Modified-Since Headers and the Content Checksum).

Recrawl these URL Patterns

Use Recrawl these URL Patterns to submit a URL to be recrawled. URLs that you enter here are recrawled as soon as possible.

For complete information about the Content Sources > Web Crawl > Freshness Tuning page, click Admin Console Help > Content Sources > Web Crawl > Freshness Tuning in the Admin Console.

Back to top

Changing the Amount of Each Document that Is Indexed

By default, the search appliance indexes up to 2.5MB of each text or HTML document, including documents that have been truncated or converted to HTML. After indexing, the search appliance caches the indexed portion of the document and discards the rest.

You can change the default by entering an new amount of up to 10MB in Index Limits on the Index > Index Settings page.

For complete information about changing index settings on this page, click Admin Console Help > Index > Index Settings in the Admin Console.

Back to top

Configuring Metadata Indexing

The search appliance has default settings for indexing metadata, including which metadata names are to be indexed, as well as how to handle multivalued metadata and date fields. You can customize the default settings or add an indexing configuration for a specific attribute by using the Index > Index Settings page. By using this page you can perform the following tasks:

For complete information about configuring metadata indexing, click Admin Console Help > Index > Index Settings in the Admin Console.

Including or Excluding Metadata Names

You might know which indexed metadata names you want to use in dynamic navigation. In this case, you can create a whitelist of names to be used by entering an RE2 regular expression that includes those names in Regular Expression and checking Include.

If you know which indexed metadata names you do not want to use in dynamic navigation, you can create a blacklist of names by entering an RE2 regular expression that includes those names in Regular Expression and selecting Exclude. Although blacklisted names do not appear in dynamic navigation options, these names are still indexed and can be searched by using the inmeta, requiredfields, and partialfields query parameters.

This option is required for dynamic navigation. For information about dynamic navigation, click Admin Console Help > Search > Search Features > Dynamic Navigation.

By default, the regular expression is ".*" and Include is selected, that is, index all metadata names and use all the names in dynamic navigation.

For complete information about creating a whitelist or blacklist of metadata names, click Admin Console Help > Index > Index Settings in the Admin Console.

Specifying Multivalued Separators

A metadata attribute can have multiple values, indicated either by multiple meta tags or by multiple values within a single meta tag, as shown in the following example:

<meta name="authors" content="S. Jones, A. Garcia">

In this example, the two values (S. Jones, A. Garcia) are separated by a comma.

By using the Multivalued Separator options, you can specify multivalued separators for the default metadata indexing configuration or for a specific metadata name. Any string except an empty string is a valid multivalued separator. An empty string causes the multiple values to be treated as a single value.

For complete information about specifying multivalued separators, click Admin Console Help > Index > Index Settings in the Admin Console.

Specifying a Date Format for Metadata Date Fields

By using the Date Format menus, you can specify a date format for metadata date fields. The following example shows a date field:

<meta name="releasedOn" content="20120714">

To specify a date format for either the default metadata indexing configuration or for a specific metadata name, select a value from the menu.

The search appliance tries to parse dates that it discovers according to the format that you select for a specific configuration or, in case you do not add a specific configuration, the default date format. If the date that the search appliance discovers in the metadata isn't of the selected format, the search appliance determines if it can parse it as any date format.

For complete information about specifying a date format, click Admin Console Help > Index > Index Settings in the Admin Console.

Back to top

Crawling over Proxy Servers

If you want the Google Search Appliance to crawl outside your internal network and include the crawled data in your index, use the Content Sources > Web Crawl > Proxy Servers page in the Admin Console. For complete information about the Content Sources > Web Crawl > Proxy Servers page, click Admin Console Help > Content Sources > Web Crawl > Proxy Servers in the Admin Console.

Back to top

Preventing Crawling of Duplicate Hosts

Many organizations have mirrored servers or duplicate hosts for such purposes as production, testing, and load balancing. Mirrored servers are also the case where multiple aliases are used or a Web site has changed names, which usually occurs when companies or departments merge.

Disadvantages of allowing the Google Search Appliance to recrawl content on mirrored servers include:

To prevent crawling of duplicate hosts, you can specify one or more “canonical,” or standard, hosts using the Content Sources > Web Crawl > Duplicate Hosts page.

For complete information about the Content Sources > Web Crawl > Duplicate Hosts page, click Admin Console Help > Content Sources > Web Crawl > Duplicate Hosts in the Admin Console.

Back to top

Enabling Infinite Space Detection

In “infinite space,” the search appliance repeatedly crawls similar URLs with the same content while useful content goes uncrawled. For example, the search appliance might start crawling infinite space if a page that it fetches contains a link back to itself with a different URL. The search appliance keeps crawling this page because, each time, the URL contains progressively more query parameters or a longer path. When a URL is in infinite space, the search appliance does not crawl links in the content.

By enabling infinite space detection, you can prevent crawling of duplicate content to avoid infinite space indexing.

To enable infinite space detection, use the Content Sources > Web Crawl > Duplicate Hosts page.

For complete information about the Content Sources > Web Crawl > Duplicate Hosts page, click Admin Console Help > Content Sources > Web Crawl > Duplicate Hosts in the Admin Console.

Back to top

Configuring Web Server Host Load Schedules

A Web server can handle several concurrent requests from the search appliance. The number of concurrent requests is known as the Web server’s “host load.” If the Google Search Appliance is crawling through a proxy, the host load limits the maximum number of concurrent connections that can be made through the proxy. The default number of concurrent requests is 4.0.

Increasing the host load can speed up the crawl rate, but it also puts more load on your Web servers. It is recommended that you experiment with the host load settings at off-peak time or in controlled environments so that you can monitor the effect it has on your Web servers.

To configure a Web Server Host Load schedule, use the Content Sources > Web Crawl > Host Load Schedule page. You can also use this page to configure exceptions to the web server host load.

Regarding file system crawling: if you’ve configured the search appliance to crawl documents from a SMB file system, it only follows the configurable default value of Web Server Host Load (default to 4.0), it does not follow the Exceptions to Web Server Host Load specifically for the SMB host. Due to design constraint, the default Web Server Host Load value can only be set to 8.0 or below, or it may effect the performance of your file system crawling.

For complete information about the Content Sources > Web Crawl > Host Load Schedule page, click Admin Console Help > Content Sources > Web Crawl > Host Load Schedule in the Admin Console.

Back to top

Removing Documents from the Index

To remove a document from the index, add the full URL of the document to Do Not Follow Patterns on the Content Sources > Web Crawl > Start and Block URLs page in the Admin Console.

Back to top

Using Collections

Collections are subsets of the index used to serve different search results to different users. For example, a collection can be organized by geography, product, job function, and so on. Collections can overlap, so one document can be relevant to several different collections, depending on its content. Collections also allow users to search targeted content more quickly and efficiently than searching the entire index.

For information about using the Index > Collections page to create and manage collections, click Admin Console Help > Index > Collections in the Admin Console.

Default Collection

During initial crawling, the Google Search Appliance establishes the default_collection, which contains all crawled content. You can redefine the default_collection but it is not advisable to do this because index diagnostics are organized by collection. Troubleshooting using the Index > Diagnostics > Index Diagnostics page becomes much harder if you cannot see all URLs crawled.

Changing URL Patterns in a Collection

Documents that are added to the index receive a tag for each collection whose URL patterns they match. If you change the URL patterns for a collection, the search appliance immediately starts a process that runs across all the crawled URLs and retags them according to the change in the URL patterns. This process usually completes in a few minutes but can take up to an hour for heavily-loaded appliances. Search results for the collection are corrected after the process finishes.

Back to top

JavaScript Crawling

The search appliance supports JavaScript crawling and can detect links and content generated dynamically through JavaScript execution. The search appliance supports dynamic link and content detection in the following situations:

If your enterprise content relies on URLs generated by JavaScript that is not covered by any of these situations, use jump pages or basic HTML site maps to force crawling of such URLs in JavaScript.

The search appliance only executes scripts embedded inside a document. The search appliance does not support:

Also, if the search appliance finds an error while parsing JavaScript, or if the JavaScript contains an error, the search appliance might fail to find links that require functions below the error. In this instance, anything below the error might be discarded.

Logical Redirects by Assignments to window.location

The search appliance crawls links specified by a logical redirect by assignment to window.location, which makes the web browser load a new document by using a specific URL.

The following code example shows a logical redirect by assignment to window.location.

<HTML> 
  <HEAD> 
    <SCRIPT type=’text/javascript’> 
      var hostName = window.location.hostname; 
      var u = "http://" + hostName + "/links" + "/link1.html";
      window.location.replace(u); 
    </SCRIPT> 
  </HEAD> 
  <BODY></BODY> 
</HTML> 

Links and Content Added by document.write and document.writeln Functions

The search appliance crawls links and indexes content that is added to a document by document.write and document.writeln functions. These functions generate document content while the document is being parsed by the browser.

The following code example shows links added to a document by document.write.

<HTML> 
  <HEAD> 
    <SCRIPT type=’text/javascript’> 
      document.write(’<a href="http://foo.google.com/links/’ 
      + ’link2.html">link2</a>’); 
      document.write( 
      ’<script>document.write(\’<a href="http://foo.google.com/links/’ 
      + ’link3.html">script within a script</a>\’) ;<\/script>’); 
    </SCRIPT> 
  </HEAD> 
  <BODY></BODY> 
</HTML>

Links that are Generated by Event Handlers

The search appliance crawls links that are generated by event handlers, such as onclick and onsubmit.

The following code example shows links generated by event handlers in an anchor and a div tag.

<HTML> 
  <HEAD> 
    <SCRIPT type=’text/javascript’> 
      function openlink(id) { 
      window.location.href = "/links/link" + id + ".html"; 
      }
    </SCRIPT> 
  </HEAD> 
  <BODY> 
    <a onclick="openlink(’4’);" href="#">attribute anchor 1</a> 
    <div onclick="openlink(’5’);">attribute anchor 2</div> 
  </BODY> 
</HTML>

Links that are JavaScript Pseudo-URLs

The search appliance crawls links that include JavaScript code and use the javascript: pseudoprotocol specifier.

The following code example shows a link that is JavaScript pseudo-URL.

<HTML> 
  <HEAD> 
    <SCRIPT type=’text/javascript’> 
      function openlink(id) { 
      window.location.href = "/links/link" + id + ".html"; 
      } 
    </SCRIPT> 
  </HEAD> 
  <BODY> 
    <a href="javascript:openlink(’6’)">JavaScript URL</a> 
  </BODY> 
</HTML>

Links with an onclick Return Value

The search appliance crawls links with an onclick return value other than false. If onclick script returns false, then the URL will not be crawled. The following code example shows both situations.

<HTML> 
  <HEAD></HEAD> 
  <BODY> 
   <a href="http://bad.com" onclick="return false;">This link will not be crawled</a> 
   <a href="http://good.com" onclick="return true;">This link will be crawled</a> 
  </BODY> 
</HTML>

Indexing Content Added by document.write/writeln Calls

Any content added to the document by document.write/writeln calls (as shown in the following example) will be indexed as a part of the original document.

<HTML> 
  <HEAD> 
    <SCRIPT type=’text/javascript’> 
      document.write(’<P>This text will be indexed.</P>’); 
    </SCRIPT> 
  </HEAD> 
  <BODY></BODY> 
</HTML>

Back to top

Discovering and Indexing Entities

Entity recognition is a feature that enables the Google Search Appliance to discover interesting entities in documents with missing or poor metadata and store these entities in the search index.

For example, suppose that your search appliance crawls and indexes multiple content sources, but only one of these sources has robust metadata. By using entity recognition, you can enrich the metadata-poor content sources with discovered entities and discover new, interesting entities in the source with robust metadata.

After you configure and enable entity recognition, the search appliance automatically discovers specific entities in your content sources during indexing, annotates them, and stores them in the index. Once the entities are indexed, you can enhance keyword search by adding the entities in dynamic navigation, which uses metadata in documents and entities discovered by entity recognition to enable users to browse search results by using specific attributes. To add the entities to dynamic navigation, use the Search > Search Features > Dynamic Navigation page.

Additionally, by default, entity recognition extracts and stores full URLs in the index. This includes both document URLs and plain text URLs that appear in documents. So you can match specific URLs with entity recognition and add them to dynamic navigation, enabling users to browse search results by full or partial URL. For details about this scenario, see Use Case: Matching URLs for Dynamic Navigation.

The Index > Entity Recognition page enables you to specify the entities that you want the search appliance to discover in your documents. If you want to identify terms that should not be stored in the index, you can upload the terms in an entity blacklist file.

Creating Dictionaries and Composite Entities

Before you can specify entities on the Index > Entity Recognition page, you must define each entity by creating dictionaries of terms and regular expressions. Dictionaries for terms are required for entity recognition. Dictionaries enable entity recognition to annotate entities, that is, to discover specific entities in the content and annotate them as entities.

Generally, with dictionaries, you define an entity with lists of terms and regular expressions. For example, the entity "Capital" might be defined by a dictionary that contains a list of country capitals: Abu Dhabi, Abuja, Accra, Addis Ababa, and so on. After you create a dictionary, you can upload it to the search appliance.

Entity recognition accepts dictionaries in either TXT or XML format.

Optionally, you can also create composite entities that run on the annotated terms. Like dictionaries, composite entities define entities, but composite entities enable the search appliance to discover more complex terms. In a composite entity, you can define an entity with a sequence of terms. Because composite entities run on annotated terms, all the words in a sequence must be tagged with an entity and so depend on dictionaries.

For example, suppose that you want to define a composite entity that detects full names, that is, combinations of titles, names, middlenames, and surnames. First, you need to define four dictionary-based entities, Title, Name, Middlename, and Surname, and provide a dictionary for each one. Then you define the composite entity, FullName, which detects full names.

A composite entity is written as an LL1 grammar.

The search appliance provides sample dictionaries and composite entities, as shown on the Index > Entity Recognition page.

Setting Up Entity Recognition

Google recommends that you perform these tasks for setting up entity recognition, in the following order:

For complete information about setting up entity recognition, click Admin Console Help > Index > Entity Recognition in the Admin Console.

Use Case: Matching URLs for Dynamic Navigation

This use case describes matching URLs with entity recognition and using them to enrich dynamic navigation options. It also shows you how to define the name of the dynamic navigation options that display, either by explicitly specifying the name or by capturing the name from the URL.

This use case assumes you have already enabled entity recognition on your GSA and added entities to dynamic navigation. Having seen how easy this feature makes browsing results, your users also want to be able to browse by URLs. These URLs include:

http://www.mycompany.com/services/...
http://www.mycompany.com/policies/...
http://www.mycompany.com/history/...

They want dynamic navigation results to include just the domains "services," "policies," and so on. You can achieve this goal by performing the following steps:

Creating an XML Dictionary that Defines an Entity for Matching URLs

The following example shows an XML dictionary for entity recognition that matches URLs. In this example, the names displayed for the dynamic navigation options are defined using the name element:

<?xml version="1.0"?>
<instances> 
  <instance> 
    <name>services</name> 
    <pattern>http://.*/services.*</pattern> 
    <store_regex_or_name>name</store_regex_or_name>
  </instance>
  <instance>
    <name>policies</name> 
    <pattern>http://.*/policies/.*</pattern>
    <store_regex_or_name>name</store_regex_or_name>
  </instance>
  <instance>
    <name>history</name> 
    <pattern>http://.*/history/.*</pattern> 
    <store_regex_or_name>name</store_regex_or_name>
  </instance>
</instances>

Note: You must create an instance for each type of URL that you want to match.

Creating an XML Dictionary that Defines an Entity for Capturing the Name from the URL

The following example shows an XML dictionary that matches URLs and captures the name of the dynamic navigation options by using the group term in the regular expression pattern:

<?xml version="1.0"?>
<instances>
  <instance>
    <name> Anything - will not be used </name>
    <pattern> http://www.mycompany.com/(\w+)/[^\s]+ </pattern>
    <store_regex_or_name> regex_tagged_as_first_group </store_regex_or_name>
  </instance>
</instances>

There are two important things to note about this example:

Adding the Entity to Entity Recognition

Add a new entity, which is defined by the dictionary:

1.
Click Index > Entity Recognition > Simple Entities.
2.
On the Simple Entities tab, enter the name of the entity in the Entity name field, for example "type-of-doc.”
3.
Click Choose File to navigate to the dictionary file in its location and select it.
4.
Under Case sensitive?, click Yes.
5.
Under Transient?, click No.
6.
Click Upload.
7.
(Optional) Click Entity Diagnostics to test that everything works.
Adding the Entity to Dynamic Navigation

To show URLs as dynamic navigation options, add the entity:

1.
Click Search > Search Features > Dynamic Navigation.
2.
Under Existing Configurations, click Add.
3.
In the Name box, type a name for the new configuration, for example “domains.”
4.
Under Attributes, click Add Entity.
5.
In the Display Label box, enter the name you want to appear in the search results, for example “TypeOfUrl.” This name can be different from the name of the entity.
6.
From the Attribute Name drop-down menu, select the name of the entity that you created, for example “type-of-doc.”
7.
From the Type drop-down menu, select STRING.
9.
Viewing URLs in the Search Results

After you perform the steps described in the preceding sections, your users will be able to view URLs in the dynamic navigation options, as shown in the following figure.

Note that dynamic navigation only displays the entities of the documents in the result set (the first 30K documents). If documents that contain entities are not in the result set, their entities are not displayed.

However, take note that entity recognition only runs on documents that are added to the index after you enable entity recognition. Documents already in the index are not affected. To run entity recognition on documents already in the index, force the search appliance to recrawl URL patterns by using the Index > Diagnostics > Index Diagnostics page.

Use Case: Testing Entity Recognition for Non-HTML Documents

This use case describes how you can test your entity recognition configuration on an indexed document that is not in HTML format. To run entity diagnostics on HTML documents, use the Index > Entity Recognition > Entity Diagnostics page in the Admin Console.

Testing Entity Recognition on a Cached Non-HTML Document

Note: This procedure does not affect the crawl and indexing of a URL

1.
Click Index > Diagnostics > Index Diagnostics.
2.
Click List format.
3.
Under All hosts, click the URL of the document that you want to test.
4.
Under More information about this page, click Open in entity diagnostics, as shown below.

Note: The Open in entity diagnostics link is only available for public documents.

Back to top

Wildcard Indexing

Wildcard search enables your users to enter queries that contain substitution patterns rather than exact spellings of terms. Wildcard indexing makes words in your content available for wildcard search.

To disable or enable wildcard indexing or the change the type of wildcard indexing, use the Index > Index Settings page in the Admin Console. For more information about wildcard indexing, click Admin Console Help > Index > Index Settings.

By default, wildcard search is enabled for each front end of the search appliance. You can disable or enable wildcard search for one or more front ends by using the Filters tab of the Search > Search Features > Front Ends page. Take note that wildcard search is not supported with Chinese, Japanese, Korean, or Thai. For more information about wildcard search, click Admin Console Help > Search > Search Features > Front Ends > Filters.