7.4 - Crawl Quick Reference

Crawl Quick Reference

Crawling and Indexing Administration Tasks

Crawling is the process where the Google Search Appliance discovers enterprise content to index. This chapter provides reference information about crawl administration tasks.

Crawling and Indexing Features

The following table lists Google Search Appliance crawl and index features. For each feature, the table lists the page in the Admin Console where you can use the feature and a reference to a section in this document that describes it.


Feature	Admin Console Page	Reference
Always force recrawl URLs	Content Sources > Web Crawl > Freshness Tuning	Freshness Tuning
Content statistics	Index > Diagnostics > Content Statistics	Using the Admin Console to Monitor a Crawl
Continuous crawl	Content Sources > Web Crawl > Crawl Schedule	Selecting a Crawl Mode
Coverage tuning	Content Sources > Web Crawl > Coverage Tuning	Coverage Tuning
Index diagnostics	Index > Diagnostics > Index Diagnostics	Using the Admin Console to Monitor a Crawl
Crawl frequently URLs	Content Sources > Web Crawl > Freshness Tuning	Freshness Tuning
Crawl infrequently URLs	Content Sources > Web Crawl > Freshness Tuning	Freshness Tuning
Crawl modes	Content Sources > Web Crawl > Crawl Schedule	Selecting a Crawl Mode
Crawl queue snapshots	Content Sources > Diagnostics > Crawl Queue	Using the Admin Console to Monitor a Crawl
Crawl schedule	Content Sources > Web Crawl > Crawl Schedule	Scheduling a Crawl
Crawl status	Content Sources > Diagnostics > Crawl Status	Using the Admin Console to Monitor a Crawl
Crawl URLs	Content Sources > Web Crawl > Start and Block URLs	Configuring a Crawl
Do not follow patterns	Content Sources > Web Crawl > Start and Block URLs	Configuring a Crawl
Document dates	Index > Document Dates	Defining Document Date Rules
Duplicate hosts	Content Sources > Web Crawl > Duplicate Hosts	Preventing Crawling of Duplicate Hosts
Entity recognition	Index > Entity Recognition	Discovering and Indexing Entities
Follow patterns	Content Sources > Web Crawl >Start and Block URLs	Configuring a Crawl
Freshness tuning	Content Sources > Web Crawl > Freshness Tuning	Freshness Tuning
Host load exceptions	Content Sources > Web Crawl > Host Load Schedule	Configuring Web Server Host Load Schedules
Host load schedule	Content Sources > Web Crawl > Host Load Schedule	Configuring Web Server Host Load Schedules
HTTP headers	Content Sources > Web Crawl > HTTP Headers	Identifying the User Agent
Index limits	Index > Index Settings	Changing the Amount of Each Document that Is Indexed
Infinite space detection	Content Sources > Web Crawl > Duplicate Hosts	Enabling Infinite Space Detection
Maximum number of URLs to crawl	Content Sources > Web Crawl > Host Load Schedule	Configuring Web Server Host Load Schedules
Metadata indexing	Index > Index Settings	Configuring Metadata Indexing
Proxy servers	Content Sources > Web Crawl > Proxy Servers	Crawling over Proxy Servers
Recrawl URLs	Content Sources > Web Crawl > Freshness Tuning	Freshness Tuning
Scheduled crawl	Content Sources > Web Crawl > Crawl Schedule	Selecting a Crawl Mode
Start crawling from the following URLs	Content Sources > Web Crawl > Start and Block URLs	Configuring a Crawl
Web server host load	Content Sources > Web Crawl > Host Load Schedule	Configuring Web Server Host Load Schedules

Crawling and Indexing Administration Tasks

The following table lists Google Search Appliance crawl and index administration tasks. For each task, the table gives a reference to a section in this document that describes it, as well as the page in the Admin Console that you use to accomplish the task.


Task	Reference	Admin Console Page
Prepare your data for crawling: robots.txt, Robots META tags, googleoff/googleon tags, no_crawl directories, shared folders, and jump pages	Preparing Data for a Crawl
Setup the crawl path: start URLs, follow patterns, do not follow patterns	Configuring a Crawl	Content Sources > Web Crawl > Start and Block URLs
Test URL patterns in the crawl path	Testing Your URL Patterns	Content Sources > Web Crawl > Start and Block URLs
Select a crawl mode: continuous crawl or scheduled crawl	Selecting a Crawl Mode	Content Sources > Web Crawl > Crawl Schedule
Schedule a crawl	Scheduling a Crawl	Content Sources > Web Crawl > Crawl Schedule
Configure a continuous crawl: URLs to crawl frequently, URLs to crawl infrequently, URLs to always force recrawl	Freshness Tuning	Content Sources > Web Crawl > Freshness Tuning
Pause or restart a continuous crawl	Stopping, Pausing, or Resuming a Crawl	Content Sources > Diagnostics > Crawl Status
Stop a scheduled crawl	Stopping, Pausing, or Resuming a Crawl	Content Sources > Diagnostics > Crawl Status
Submit a URL to be recrawled	Freshness Tuning	Content Sources > Web Crawl > Freshness Tuning
Submit a URL to be recrawled	Submitting a URL to Be Recrawled	Index > Diagnostics > Index Diagnostics
Change the amount of each document that is indexed	Changing the Amount of Each Document that Is Indexed	Index > Index Settings
Configure metadata indexing	Configuring Metadata Indexing	Index > Index Settings
Set up entity recognition	Discovering and Indexing Entities	Index > Entity Recognition
Control the number of URLs the search appliance crawls for a site	Coverage Tuning	Content Sources > Web Crawl > Coverage Tuning
Set up proxies for Web servers	Crawling over Proxy Servers	Content Sources > Web Crawl > Proxy Servers
Locate or change the user-agent name	Identifying the User Agent	Content Sources > Web Crawl > HTTP Headers
Enter additional HTTP headers for the search appliance crawler to use	Identifying the User Agent	Content Sources > Web Crawl > HTTP Headers
Prevent recrawling of content that resides on duplicate hosts	Preventing Crawling of Duplicate Hosts	Content Sources > Web Crawl > Duplicate Hosts
Prevent crawling of duplicate content to avoid infinite space indexing	Enabling Infinite Space Detection	Content Sources > Web Crawl > Duplicate Hosts
Define rules for the search appliance crawler to use as it indexes documents	Defining Document Date Rules	Index > Document Dates
Specify the maximum number of URLs to crawl for a host and the average number of concurrent connections to open to each Web server for crawling	Configuring Web Server Host Load Schedules	Content Sources > Web Crawl > Host Load Schedule
View the current crawl mode and summary information about events of the past 24 hours in a crawl	Using the Admin Console to Monitor a Crawl	Content Sources > Diagnostics > Crawl Status
View crawl history for all hosts, a specific host, or a specific file		Index > Diagnostics > Index Diagnostics
Define and view a snapshot of uncrawled URLs in the crawl queue		Content Sources > Diagnostics > Crawl Queue
View summary information about files that have been crawled		Index > Diagnostics > Content Statistics
View current license information	What Is the Search Appliance License Limit?	Administration > License

Admin Console Basic Crawl Pages

The following table lists Google Search Appliance Admin Console pages that are used to administer a basic crawl. For each Admin Console page, the table provides a reference to a section in this document that describes using the page.


Admin Console Page	Reference
Content Sources > Web Crawl > Start and Block URLs	Configuring a Crawl
Content Sources > Web Crawl > Crawl Schedule	Selecting a Crawl Mode Scheduling a Crawl
Content Sources > Web Crawl > Proxy Servers	Crawling over Proxy Servers
Content Sources > Web Crawl > HTTP Headers	Identifying the User Agent
Content Sources > Web Crawl > Duplicate Hosts	Preventing Crawling of Duplicate Hosts
Index > Document Dates	Defining Document Date Rules
Content Sources > Web Crawl > Host Load Schedule	Configuring Web Server Host Load Schedules
Content Sources > Web Crawl > Coverage Tuning	Coverage Tuning
Content Sources > Web Crawl > Freshness Tuning	Freshness Tuning
Index > Index Settings	Changing the Amount of Each Document that Is Indexed
Index > Index Settings	Configuring Metadata Indexing
Index > Entity Recognition	Discovering and Indexing Entities
Content Sources > Diagnostics > Crawl Status	Using the Admin Console to Monitor a Crawl
Index > Diagnostics > Index Diagnostics
Content Sources > Diagnostics > Crawl Queue
Index > Diagnostics > Content Statistics

Google Search Appliance Documentation

Crawl Quick Reference

Crawling and Indexing Features

Crawling and Indexing Administration Tasks

Admin Console Basic Crawl Pages