Monitoring and Troubleshooting Crawls
Using the Admin Console to Monitor a Crawl
The Admin console provides Reports pages that enable you to monitor crawling. The following table describes monitoring tasks that you can perform using these pages.
While the Google Search Appliance is crawling, you can view summary information about events of the past 24 hours using the Content Sources > Diagnostics > Crawl Status page. You can also use this page to stop a scheduled crawl, or to pause or restart a continuous crawl (see Stopping, Pausing, or Resuming a Crawl). |
||
While the Google Search Appliance is crawling, you can view its history using the Index > Diagnostics > Index Diagnostics page. Index diagnostics, as well as search logs and search reports, are organized by collection (see Using Collections). When the Index > Diagnostics > Index Diagnostics page first appears, it shows the crawl history for the current domain. It shows each URL that has been fetched and timestamps for the last 10 fetches. If the fetch was not successful, an error message is also listed. From the domain level, you can navigate to lower levels that show the history for a particular host, directory, or URL. At each level, the Index > Diagnostics > Index Diagnostics page displays information that is pertinent to the selected level. At the URL level, the Index > Diagnostics > Index Diagnostics page shows summary information as well as a detailed Crawl History. You can also use this page to submit a URL for recrawl (see Submitting a URL to Be Recrawled). |
||
Any time while the Google Search Appliance is crawling, you can define and view a snapshot of the queue using the Content Sources > Diagnostics > Crawl Queue page. A crawl queue snapshot displays URLs that are waiting to be crawled, as of the moment of the snapshot. |
||
At any time while the Google Search Appliance is crawling, you can view summary information about files that have been crawled using the Index > Diagnostics > Content Statistics page. You can also use this page to export the summary information to a comma-separated values file. |
Crawl Status Messages
In the Crawl History for a specific URL on the Index > Diagnostics > Index Diagnostics page, the Crawl Status column lists various messages, as described in the following table.
Retrying URL: Host Unreachable while trying to fetch robots.txt |
The Google Search Appliance could not connect to a Web server when trying to fetch robots.txt. |
The Google Search Appliance could not connect to a Web server due to networking issue. |
|
The crawler caches the results of DNS queries for a long time regardless of the TTL values specified in the DNS response. A workaround is to save and then revert a pattern change on the Content Sources > Web Crawl > Proxy Servers page. Saving changes here causes internal processes to restart and flush out the DNS cache. |
Network Connectivity Test of Start URLs Failed
Slow Crawl Rate
The Content Sources > Diagnostics > Crawl Status page in the Admin Console displays the Current Crawling Rate, which is the number of URLs being crawled per second. Slow crawling may be caused by the following factors:
These factors are described in the following sections.
Non-HTML Content
If the search appliance is crawling a single UNIX/Linux Web server, you can run the tail command-line utility on the server access logs to see what was recently crawled. The tail utility copies the last part of a file. You can also run the tcpdump command to create a dump of network traffic that you can use to analyze a crawl.
If the search appliance is crawling multiple Web servers, it can crawl through a proxy.
Complex Content
Crawling many complex documents can cause a slow crawl rate.
To ensure that static complex documents are not recrawled as often as dynamic documents, add the URL patterns to the Crawl Infrequently URLs on the Content Sources > Web Crawl > Freshness Tuning page (see Freshness Tuning).
Host Load
To speed up crawling, you may need to increase the value of concurrent connections to the Web server by using the Content Sources > Web Crawl > Host Load Schedule page (see Configuring Web Server Host Load Schedules).
Network Problems
Slow Web Servers
You can also log in to a Web server to determine whether there are any internal bottlenecks.
Query Load
Wait Times
Wait times can occur when your enterprise content includes:
It is not possible for an administrator to view the maximum wait time for URLs in the crawl queue or to view the number of URLs in the queue whose scheduled crawl time has passed. However, you can use the Content Sources > Diagnostics > Crawl Queue page to create a crawl queue snapshot, which shows:
Errors from Web Servers
If the Google Search Appliance receives an error when fetching a URL, it records the error in Index > Diagnostics > Index Diagnostics. By default, the search appliance takes action based on whether the error is permanent or temporary:
•
|
Immediate Index Removal—Select this option to immediately remove the URL from the index
|
•
|
Number of Failures for Index Removal—Use this option to specify the number of times the search appliance is to retry fetching a URL
|
•
|
Successive Backoff Intervals (hours)—Use this option to specify the number of hours between backoff intervals
|
To configure settings, use the options in the Configure Backoff Retries and Remove Index Information section of the Content Sources > Web Crawl > Crawl Schedule page in the Admin Console. For more information about configuring settings, click Admin Console Help > Content Sources > Web Crawl > Crawl Schedule.
The following table lists permanent and temporary Web server errors. For detailed information about HTTP status codes, see http://en.wikipedia.org/wiki/List_of_HTTP_status_codes.
You can view errors for a specific URL in the Crawl Status column on the Index > Diagnostics > Index Diagnostics page.
URL Moved Permanently Redirect (301)
When the Google Search Appliance crawls a URL that has moved permanently, the Web server returns a 301 status. For example, the search appliance crawls the old address, http://myserver.com/301-source.html, and is redirected to the new address, http://myserver.com/301-destination.html. On the Index > Diagnostics > Index Diagnostics page, the Crawl Status of the URL displays “Source page of permanent redirect” for the source URL and “Crawled: New Document” for the destination URL.
In search results, the URL of the 301 redirect appears as the URL of the destination page.
For example, if a user searches for info:http://myserver.com/301-<source>.html, the results display http://myserver.com/301-<destination>.html.
To enable search results to display a 301 redirect, ensure that start and follow URL patterns on the Content Sources > Web Crawl > Start and Block URLs page match both the source page and the destination page.
URL Moved Temporarily Redirect (302)
When the Google Search Appliance crawls a URL that has moved temporarily, the Web server returns a 302 status. On the Index > Diagnostics > Index Diagnostics page, the Crawl Status of the URL shows the following value for the source page:
There is no entry for the destination page in a 302 redirect.
In search results, the URL of the 302 redirect appears as the URL of the source page.
If the redirect destination URL does not match a Follow pattern, or matches a Do Not Follow Pattern, on the Content Sources > Web Crawl > Start and Block URLs page, the document does not display in search results. On the Index > Diagnostics > Index Diagnostics page, the Crawl Status of the URL shows the following value for the source page:
A META tag that specifies http-equiv="refresh" is handled as a 302 redirect.
Authentication Required (401) or Document Not Found (404) for SMB File Share Crawls
•
|
Ensure that the URL patterns entered on the Content Sources > Web Crawl > Start and Block URLs page are in the format smb://.//
|
•
|
Ensure that you have entered the appropriate patterns for authentication on the Content Sources > Web Crawl > Secure Crawl > Crawler Access page.
|
nmap <fileshare host> -p 139,445
telnet <fileshare-host> 139
telnet <fileshare-host> 445
A connection should be established rather than refused.
If the search appliance is crawling a Windows file share, verify that NTLMv2 is enabled on the Window file share by following section 10 in Microsoft Support’s document (http://support.microsoft.com/kb/823659). Take note that NTLMv1 is very insecure and is not supported.
Take note that you can also use a script on the Google Search Appliance Admin Toolkit project page for additional diagnostics outside the search appliance. To access the script, visit http://gsa-admin-toolkit.googlecode.com/svn/trunk/smbcrawler.py.
Cyclic Redirects
URL Rewrite Rules
If the URL matches one of the patterns, it is rewritten or rejected before it is fetched.
BroadVision Web Server
For example, before the rewrite, this is the URL:
http://www.broadvision.com/OneToOne/SessionMgr
/home_page.jsp?BV_SessionID=NNNN0974886399.1076010447NNNN&BV_EngineID=ccceadcjdhdfelgcefe4ecefedghhdfjk.0
After the rewrite, this is the URL:
http://www.broadvision.com/OneToOne/SessionMgr/home_page.jsp
Sun Java System Web Server
Microsoft Commerce Server
For example, before the rewrite, this is the URL:
http://www.shoprogers.com/homeen.asp?shopperID=PBA1XEW6H5458NRV2VGQ909
After the rewrite, this is the URL:
http://www.shoprogers.com/homeen.asp
Servers that Run Java Servlet Containers
Lotus Domino Enterprise Server
The Google Search Appliance rejects URL patterns that contain:
The search appliance rewrites:
OpenDocument URLs
The syntax for this type of URL is http://Host/Database/View/DocumentID?OpenDocument. The search appliance rewrites this as http://Host /Database/0/DocumentID?OpenDocument
For example, before the rewrite, this is the URL:
http://www12.lotus.com/idd/doc/domino_notes/5.0.1/readme.nsf
/8d7955daacc5bdbd852567a1005ae562/c8dac6f3fef2f475852567a6005fb38f
After the rewrite, this is the URL:
http://www12.lotus.com/idd/doc/domino_notes/5.0.1/readme.nsf/0/c8dac6f3fef2f475852567a6005fb38f?OpenDocument
URLs with # Suffixes
The Google Search Appliance removes suffixes that begin with # from URLs that have no parameters.
Multiple Versions of the Same URL
To convert a URL to a canonical URL, the search appliance makes the following changes:
For example, before the rewrite, this is the URL:
http://www-12.lotus.com/ldd/doc/domino_notes/5.0.1/readme.nsf?OpenDatabase&Count=30&Expand=3
After the rewrite, this is the URL:
http://www12.lotus.com/ldd/doc/domino_notes/5.0.1/readme.nsf?OpenDatabase&Start=1&Count=1000&ExpandView
ColdFusion Application Server
Index Pages
In URLs for index pages, the Google Search Appliance removes index.htm or index.html from the end of URLs before fetching them. It also automatically removes them from Start URLs that you enter on the Content Sources > Web Crawl > Start and Block URLs page in the Admin Console.
For example, before the rewrite, this is the URL:
http://www.google.com/index.html
After the rewrite, this is the URL:
http://www.google.com/