Crawling is the process where the Google Search Appliance discovers enterprise content to index. This chapter provides reference information about crawl administration tasks.
Back to top
The following table lists Google Search Appliance crawl and index features. For each feature, the table lists the page in the Admin Console where you can use the feature and a reference to a section in this document that describes it.
Always force recrawl URLs
Content Sources > Web Crawl > Freshness Tuning
Freshness Tuning
Content statistics
Index > Diagnostics > Content Statistics
Using the Admin Console to Monitor a Crawl
Continuous crawl
Content Sources > Web Crawl > Crawl Schedule
Selecting a Crawl Mode
Coverage tuning
Content Sources > Web Crawl > Coverage Tuning
Coverage Tuning
Index diagnostics
Index > Diagnostics > Index Diagnostics
Crawl frequently URLs
Crawl infrequently URLs
Crawl modes
Crawl queue snapshots
Content Sources > Diagnostics > Crawl Queue
Crawl schedule
Scheduling a Crawl
Crawl status
Content Sources > Diagnostics > Crawl Status
Crawl URLs
Content Sources > Web Crawl > Start and Block URLs
Configuring a Crawl
Do not follow patterns
Document dates
Index > Document Dates
Defining Document Date Rules
Duplicate hosts
Content Sources > Web Crawl > Duplicate Hosts
Preventing Crawling of Duplicate Hosts
Entity recognition
Index > Entity Recognition
Discovering and Indexing Entities
Follow patterns
Content Sources > Web Crawl >Start and Block URLs
Freshness tuning
Host load exceptions
Content Sources > Web Crawl > Host Load Schedule
Configuring Web Server Host Load Schedules
Host load schedule
HTTP headers
Content Sources > Web Crawl > HTTP Headers
Identifying the User Agent
Index limits
Index > Index Settings
Changing the Amount of Each Document that Is Indexed
Infinite space detection
Enabling Infinite Space Detection
Maximum number of URLs to crawl
Metadata indexing
Configuring Metadata Indexing
Proxy servers
Content Sources > Web Crawl > Proxy Servers
Crawling over Proxy Servers
Recrawl URLs
Scheduled crawl
Start crawling from the following URLs
Web server host load
The following table lists Google Search Appliance crawl and index administration tasks. For each task, the table gives a reference to a section in this document that describes it, as well as the page in the Admin Console that you use to accomplish the task.
Prepare your data for crawling: robots.txt, Robots META tags, googleoff/googleon tags, no_crawl directories, shared folders, and jump pages
Preparing Data for a Crawl
Setup the crawl path: start URLs, follow patterns, do not follow patterns
Test URL patterns in the crawl path
Testing Your URL Patterns
Select a crawl mode: continuous crawl or scheduled crawl
Schedule a crawl
Configure a continuous crawl: URLs to crawl frequently, URLs to crawl infrequently, URLs to always force recrawl
Pause or restart a continuous crawl
Stopping, Pausing, or Resuming a Crawl
Stop a scheduled crawl
Submit a URL to be recrawled
Submitting a URL to Be Recrawled
Change the amount of each document that is indexed
Configure metadata indexing
Set up entity recognition
Control the number of URLs the search appliance crawls for a site
Set up proxies for Web servers
Locate or change the user-agent name
Enter additional HTTP headers for the search appliance crawler to use
Prevent recrawling of content that resides on duplicate hosts
Prevent crawling of duplicate content to avoid infinite space indexing
Define rules for the search appliance crawler to use as it indexes documents
Specify the maximum number of URLs to crawl for a host and the average number of concurrent connections to open to each Web server for crawling
View the current crawl mode and summary information about events of the past 24 hours in a crawl
View crawl history for all hosts, a specific host, or a specific file
Define and view a snapshot of uncrawled URLs in the crawl queue
View summary information about files that have been crawled
View current license information
What Is the Search Appliance License Limit?
Administration > License
The following table lists Google Search Appliance Admin Console pages that are used to administer a basic crawl. For each Admin Console page, the table provides a reference to a section in this document that describes using the page.