Google logo
Google Search Appliance Documentation

Administering Crawl
PDF Previous Next
Monitoring and Troubleshooting Crawls

Monitoring and Troubleshooting Crawls

Crawling is the process where the Google Search Appliance discovers enterprise content to index. This chapter tells search appliance administrators how to monitor a crawl. It also describes how to troubleshoot some common problems that may occur during a crawl.

Back to top

Using the Admin Console to Monitor a Crawl

The Admin console provides Reports pages that enable you to monitor crawling. The following table describes monitoring tasks that you can perform using these pages.

 

Monitor crawling status

Content Sources > Diagnostics > Crawl Status

While the Google Search Appliance is crawling, you can view summary information about events of the past 24 hours using the Content Sources > Diagnostics > Crawl Status page.

You can also use this page to stop a scheduled crawl, or to pause or restart a continuous crawl (see Stopping, Pausing, or Resuming a Crawl).

Monitor crawling crawl

Index > Diagnostics > Index Diagnostics

While the Google Search Appliance is crawling, you can view its history using the Index > Diagnostics > Index Diagnostics page. Index diagnostics, as well as search logs and search reports, are organized by collection (see Using Collections).

When the Index > Diagnostics > Index Diagnostics page first appears, it shows the crawl history for the current domain. It shows each URL that has been fetched and timestamps for the last 10 fetches. If the fetch was not successful, an error message is also listed.

From the domain level, you can navigate to lower levels that show the history for a particular host, directory, or URL. At each level, the Index > Diagnostics > Index Diagnostics page displays information that is pertinent to the selected level.

At the URL level, the Index > Diagnostics > Index Diagnostics page shows summary information as well as a detailed Crawl History.

You can also use this page to submit a URL for recrawl (see Submitting a URL to Be Recrawled).

Take a snapshot of the crawl queue

Content Sources > Diagnostics > Crawl Queue

Any time while the Google Search Appliance is crawling, you can define and view a snapshot of the queue using the Content Sources > Diagnostics > Crawl Queue page. A crawl queue snapshot displays URLs that are waiting to be crawled, as of the moment of the snapshot.

For each URL, the snapshot shows:

View information about crawled files

Index > Diagnostics > Content Statistics

At any time while the Google Search Appliance is crawling, you can view summary information about files that have been crawled using the Index > Diagnostics > Content Statistics page. You can also use this page to export the summary information to a comma-separated values file.

Crawl Status Messages

In the Crawl History for a specific URL on the Index > Diagnostics > Index Diagnostics page, the Crawl Status column lists various messages, as described in the following table.

 

Crawled: New Document

The Google Search Appliance successfully fetched this URL.

Crawled: Cached Version

The Google Search Appliance crawled the cached version of the document. The search appliance sent an if-modified-since field in the HTTP header in its request and received a 304 response, indicating that the document is unchanged since the last crawl.

Retrying URL: Connection Timed Out

The Google Search Appliance set up a connection to the Web server and sent its request, but the Web server did not respond within three minutes or the HTTP transaction didn’t complete after 3 minutes.

Retrying URL: Host Unreachable while trying to fetch robots.txt

The Google Search Appliance could not connect to a Web server when trying to fetch robots.txt.

Retrying URL: Network unreachable during fetch

The Google Search Appliance could not connect to a Web server due to networking issue.

Retrying URL: Received 500 server error

The Google Search Appliance received a 500 status message from the Web server, indicating that there was an internal error on the server.

Excluded: Document not found (404)

The Google Search Appliance did not successfully fetch this URL. The Web server responded with a 404 status, which indicates that the document was not found. If a URL gets a status 404 when it is recrawled, it is removed from the index within 30 minutes.

Cookie Server Failed

The Google Search Appliance did not successfully fetch a cookie using the cookie rule. Before crawling any Web pages that match patterns defined for Forms Authentication, the search appliance executes the cookie rules.

Error: Permanent DNS failure

The Google Search Appliance cannot resolve the host. Possible reasons can be a change in your DNS servers while the appliance still tries to access the previously cached IP.

The crawler caches the results of DNS queries for a long time regardless of the TTL values specified in the DNS response. A workaround is to save and then revert a pattern change on the Content Sources > Web Crawl > Proxy Servers page. Saving changes here causes internal processes to restart and flush out the DNS cache.

Back to top

Network Connectivity Test of Start URLs Failed

When crawling, the Google Search Appliance tests network connectivity by attempting to fetch every start URL every 30 minutes. If less than 10% return OK responses, the search appliance assumes that there are network connectivity issues with a content server and slows down or stops and displays the following message: “Crawl has stopped because network connectivity test of Start URLs failed.” The crawl restarts when the start URL connectivity test returns an HTTP 200 response.

Back to top

Slow Crawl Rate

The Content Sources > Diagnostics > Crawl Status page in the Admin Console displays the Current Crawling Rate, which is the number of URLs being crawled per second. Slow crawling may be caused by the following factors:

These factors are described in the following sections.

Non-HTML Content

The Google Search Appliance converts non-HTML documents, such as PDF files and Microsoft Office documents, to HTML before indexing them. This is a CPU-intensive process that can take up to five seconds per document. If more than 100 documents are queued up for conversion to HTML, the search appliance stops fetching more URLs.

You can see the HTML that is produced by this process by clicking the cached link for a document in the search results.

If the search appliance is crawling a single UNIX/Linux Web server, you can run the tail command-line utility on the server access logs to see what was recently crawled. The tail utility copies the last part of a file. You can also run the tcpdump command to create a dump of network traffic that you can use to analyze a crawl.

If the search appliance is crawling multiple Web servers, it can crawl through a proxy.

Complex Content

Crawling many complex documents can cause a slow crawl rate.

To ensure that static complex documents are not recrawled as often as dynamic documents, add the URL patterns to the Crawl Infrequently URLs on the Content Sources > Web Crawl > Freshness Tuning page (see Freshness Tuning).

Host Load

If the Google Search Appliance crawler receives many temporary server errors (500 status codes) when crawling a host, crawling slows down.

To speed up crawling, you may need to increase the value of concurrent connections to the Web server by using the Content Sources > Web Crawl > Host Load Schedule page (see Configuring Web Server Host Load Schedules).

Network Problems

Network problems, such as latency, packet loss, or reduced bandwidth can be caused by several factors, including:

To find out what is causing a network problem, you can run tests from a device on the same network as the search appliance.

Use the wget program (available on most operating systems) to retrieve some large files from the Web server, with both crawling running and crawling paused. If it takes significantly longer with crawling running, you may have network problems.

Run the traceroute network tool from a device on the same network as the search appliance and the Web server. If your network does not permit Internet Control Message Protocol (ICMP), then you can use tcptraceroute. You should run the traceroute with both crawling running and crawling paused. If it takes significantly longer with crawling running, you may have network performance problems.

Packet loss is another indicator of a problem. You can narrow down the network hop that is causing the problem by seeing if there is a jump in the times taken at one point on the route.

Slow Web Servers

If response times are slow, you may have a slow Web server. To find out if your Web server is slow, use the wget command to retrieve some large files from the Web server. If it takes approximately the same time using wget as it does while crawling, you may have a slow Web server.

You can also log in to a Web server to determine whether there are any internal bottlenecks.

If you have a slow host, the search appliance crawler fetches lower-priority URLs from other hosts while continuing to crawl the slower host.

Query Load

The crawl processes on the search appliance are run at a lower priority than the processes that serve results. If the search appliance is heavily loaded serving search queries, the crawl rate drops.

Back to top

Wait Times

During continuous crawling, you may find that the Google Search Appliance is not recrawling URLs as quickly as specified by scheduled crawl times in the crawl queue snapshot. The amount of time that a URL has been in the crawl queue past its scheduled recrawl time is the URL’s “wait time.”

Wait times can occur when your enterprise content includes:

If the search appliance crawler needs four hours to catch up to the URLs in the crawl queue whose scheduled crawl time has already passed, the wait time for crawling the URLs is four hours. In extreme cases, wait times can be several days. The search appliance cannot recrawl a URL more frequently than the wait time.

It is not possible for an administrator to view the maximum wait time for URLs in the crawl queue or to view the number of URLs in the queue whose scheduled crawl time has passed. However, you can use the Content Sources > Diagnostics > Crawl Queue page to create a crawl queue snapshot, which shows:

Back to top

Errors from Web Servers

If the Google Search Appliance receives an error when fetching a URL, it records the error in Index > Diagnostics > Index Diagnostics. By default, the search appliance takes action based on whether the error is permanent or temporary:

You can either use the search appliance default settings for index removal and backoff intervals, or configure the following options for the selected error state:

Immediate Index Removal—Select this option to immediately remove the URL from the index
Number of Failures for Index Removal—Use this option to specify the number of times the search appliance is to retry fetching a URL
Successive Backoff Intervals (hours)—Use this option to specify the number of hours between backoff intervals

To configure settings, use the options in the Configure Backoff Retries and Remove Index Information section of the Content Sources > Web Crawl > Crawl Schedule page in the Admin Console. For more information about configuring settings, click Admin Console Help > Content Sources > Web Crawl > Crawl Schedule.

The following table lists permanent and temporary Web server errors. For detailed information about HTTP status codes, see http://en.wikipedia.org/wiki/List_of_HTTP_status_codes.

 

301

Permanent

Redirect, URL moved permanently.

302

Temporary

Redirect, URL moved temporarily.

401

Temporary

Authentication required.

404

Temporary

Document not found. URLs that get a 404 status response when they are recrawled are removed from the index within 30 minutes.

500

Temporary

Temporary server error.

501

Permanent

Not implemented.

In addition, the search appliance crawler refrains from visiting Web pages that have noindex and nofollow Robots META tags. For URLs excluded by Robots META tags, the maximum retry interval is one month.

You can view errors for a specific URL in the Crawl Status column on the Index > Diagnostics > Index Diagnostics page.

URL Moved Permanently Redirect (301)

When the Google Search Appliance crawls a URL that has moved permanently, the Web server returns a 301 status. For example, the search appliance crawls the old address, http://myserver.com/301-source.html, and is redirected to the new address, http://myserver.com/301-destination.html. On the Index > Diagnostics > Index Diagnostics page, the Crawl Status of the URL displays Source page of permanent redirect” for the source URL and “Crawled: New Document” for the destination URL.

In search results, the URL of the 301 redirect appears as the URL of the destination page.

For example, if a user searches for info:http://myserver.com/301-<source>.html, the results display http://myserver.com/301-<destination>.html.

To enable search results to display a 301 redirect, ensure that start and follow URL patterns on the Content Sources > Web Crawl > Start and Block URLs page match both the source page and the destination page.

URL Moved Temporarily Redirect (302)

When the Google Search Appliance crawls a URL that has moved temporarily, the Web server returns a 302 status. On the Index > Diagnostics > Index Diagnostics page, the Crawl Status of the URL shows the following value for the source page:

There is no entry for the destination page in a 302 redirect.

In search results, the URL of the 302 redirect appears as the URL of the source page.

If the redirect destination URL does not match a Follow pattern, or matches a Do Not Follow Pattern, on the Content Sources > Web Crawl > Start and Block URLs page, the document does not display in search results. On the Index > Diagnostics > Index Diagnostics page, the Crawl Status of the URL shows the following value for the source page:

A META tag that specifies http-equiv="refresh" is handled as a 302 redirect.

Authentication Required (401) or Document Not Found (404) for SMB File Share Crawls

When the Google Search Appliance attempts to crawl content on SMB-based file systems, the web server might return 401 or 404 status. If this happens, take the following actions:

Ensure that the URL patterns entered on the Content Sources > Web Crawl > Start and Block URLs page are in the format smb://.//
Ensure that you have entered the appropriate patterns for authentication on the Content Sources > Web Crawl > Secure Crawl > Crawler Access page.

On the file share server, ensure that the directories or files you have configured for crawling are not empty. Also, on the file share server (in the configuration panel), verify that:

Also, ensure that the file server accepts inbound TCP connections on ports 139, 445. These ports on the file share need to be accessible by the search appliance. You can verify whether the ports are open by using the nmap command on a machine on the same subnet as the search appliance. Run the following command:

nmap <fileshare host> -p 139,445

The response needs to be “open” for both. If the nmap command is not available on the machine you are using, you can use the telnet command for each of the ports individually. Run the following commands:

telnet <fileshare-host> 139
telnet <fileshare-host> 445

A connection should be established rather than refused.

If the search appliance is crawling a Windows file share, verify that NTLMv2 is enabled on the Window file share by following section 10 in Microsoft Support’s document (http://support.microsoft.com/kb/823659). Take note that NTLMv1 is very insecure and is not supported.

Take note that you can also use a script on the Google Search Appliance Admin Toolkit project page for additional diagnostics outside the search appliance. To access the script, visit http://gsa-admin-toolkit.googlecode.com/svn/trunk/smbcrawler.py.

Cyclic Redirects

A cyclic redirect is a request for a URL in which the response is a redirect back to the same URL with a new cookie. The search appliance detects cyclic redirects and sets the appropriate cookie.

Back to top

URL Rewrite Rules

In certain cases, you may notice URLs in the Admin Console that differ slightly from the URLs in your environment. The reason for this is that the Google Search Appliance automatically rewrites or rejects a URL if the URL matches certain patterns. The search appliance rewrites the URL for the following reasons:

Before rewriting a URL, the search appliance crawler attempts to match it against each of the patterns described for:

If the URL matches one of the patterns, it is rewritten or rejected before it is fetched.

BroadVision Web Server

In URLs for BroadVision Web server, the Google Search Appliance removes the BV_SessionID and BV_EngineID parameters before fetching URLs.

For example, before the rewrite, this is the URL:

http://www.broadvision.com/OneToOne/SessionMgr
/home_page.jsp?BV_SessionID=NNNN0974886399.1076010447NNNN&BV_EngineID=ccceadcjdhdfelgcefe4ecefedghhdfjk.0

After the rewrite, this is the URL:

http://www.broadvision.com/OneToOne/SessionMgr/home_page.jsp

Sun Java System Web Server

In URLs for Sun Java System Web Server, the Google Search Appliance removes the GXHC_qx_session_id parameter before fetching URLs.

Microsoft Commerce Server

In URLs for Microsoft Commerce Server, the Google Search Appliance removes the shopperID parameter before fetching URLs.

For example, before the rewrite, this is the URL:

http://www.shoprogers.com/homeen.asp?shopperID=PBA1XEW6H5458NRV2VGQ909

After the rewrite, this is the URL:

http://www.shoprogers.com/homeen.asp

Servers that Run Java Servlet Containers

In URLs for servers that run Java servlet containers, the Google Search Appliance removes jsessionid, $jsessionid$, and $sessionid$ parameters before fetching URLs.

Lotus Domino Enterprise Server

Lotus Domino Enterprise URLs patterns are case-sensitive and are normally recognized by the presence of .nsf in the URL along with a well-known command such as “OpenDocument” or “ReadForm.” If your Lotus Domino Enterprise URL does not match any of the cases below, then it does not trigger the rewrite or reject rules.

The Google Search Appliance rejects URL patterns that contain:

The search appliance rewrites:

The following sections provide details about search appliance rewrite rules for Lotus Domino Enterprise server.

OpenDocument URLs

The Google Search Appliance rewrites OpenDocument URLs to substitute a 0 for the view name. This is a method for accessing the document regardless of view, and stops the search appliance crawler from fetching multiple views of the same document.

The syntax for this type of URL is http://Host/Database/View/DocumentID?OpenDocument. The search appliance rewrites this as http://Host /Database/0/DocumentID?OpenDocument

For example, before the rewrite, this is the URL:

http://www12.lotus.com/idd/doc/domino_notes/5.0.1/readme.nsf
/8d7955daacc5bdbd852567a1005ae562/c8dac6f3fef2f475852567a6005fb38f

After the rewrite, this is the URL:

http://www12.lotus.com/idd/doc/domino_notes/5.0.1/readme.nsf/0/c8dac6f3fef2f475852567a6005fb38f?OpenDocument
URLs with # Suffixes

The Google Search Appliance removes suffixes that begin with # from URLs that have no parameters.

Multiple Versions of the Same URL

The Google Search Appliance converts a URL that has multiple possible representations into one standard, or canonical URL. The search appliance does this conversion so that it does not fetch multiple versions of the same URL with differing order of parameters. The search appliance’s canonical URL has the following syntax for the parameters that follow the question mark:

To convert a URL to a canonical URL, the search appliance makes the following changes:

For example, before the rewrite, this is the URL:

http://www-12.lotus.com/ldd/doc/domino_notes/5.0.1/readme.nsf?OpenDatabase&Count=30&Expand=3

After the rewrite, this is the URL:

http://www12.lotus.com/ldd/doc/domino_notes/5.0.1/readme.nsf?OpenDatabase&Start=1&Count=1000&ExpandView

ColdFusion Application Server

In URLs for ColdFusion application server, the Google Search Appliance removes CFID and CFTOKEN parameters before fetching URLs.

Index Pages

In URLs for index pages, the Google Search Appliance removes index.htm or index.html from the end of URLs before fetching them. It also automatically removes them from Start URLs that you enter on the Content Sources > Web Crawl > Start and Block URLs page in the Admin Console.

For example, before the rewrite, this is the URL:

http://www.google.com/index.html

After the rewrite, this is the URL:

http://www.google.com/