Google Search Appliance Documentation
Administering Crawl
Front Matter
Introduction
Preparing for a Crawl
Running a Crawl
Monitoring and Troubleshooting Crawls
Advanced Topics
Database Crawling and Serving
Constructing URL Patterns
Crawl Quick Reference
Index
Index
Symbols
^ character
91,
92,
94
. character
94
.* character
94
.tar files
37
.tar.gz files
37
.tgz files
37
.zip files
37
#suffixes
55
$ character
90,
92,
94
Numerics
200 status code
22
301 status code
51
302 status code
51
304 status code
17
401 status code
51,
52
404 status code
51,
52
500 status code
51
501 status code
51
A
Administration > License page
23,
26
Administration > Network Settings page
29,
35
Administration > System Settings page
24
area tag
12
authentication
9,
26,
28,
47,
51
B
backreferences
96
BroadVision Web server
53
C
caret character
91,
92
case sensitivity in URL patterns
88
checksum
16,
17
ColdFusion application server
56
collections
default collection
62
description
62
URL patterns
62
comments in URL patterns
88
compressed files, indexing
10,
37
contains prefix
93
content
amount indexed
59
can be crawled
9–
10
checksum
17
complex
48
compressed files
10
database data source
76
databases
10
network file shares
10
non-HTML
48
not crawled
10–
11
public web content
9
secure web content
9
Content Sources > Databases page
38,
44,
74–
85
Content Sources > Diagnostics > Crawl Queue page
46,
50
Content Sources > Diagnostics > Crawl Status page
24,
43,
46,
48
Content Sources > Feeds page
82
Content Sources > Web Crawl > Coverage Tuning page
58
Content Sources > Web Crawl > Crawl Schedule page
42,
51
Content Sources > Web Crawl > Duplicate Hosts page
61
Content Sources > Web Crawl > Freshness Tuning page
21,
22,
43,
48,
58
Content Sources > Web Crawl > Host Load Schedule page
18,
48,
61
Content Sources > Web Crawl > HTTP Headers page
57
Content Sources > Web Crawl > Proxy Servers page
60
Content Sources > Web Crawl > Secure Crawl > Crawler Access page
28,
52
Content Sources > Web Crawl > Start and Block URLs
21
Content Sources > Web Crawl > Start and Block URLs page
18,
23,
34,
52,
56,
62,
83,
86–
87,
96
Content Sources > Web Crawl >Start and Block URLs page
26,
35,
36,
37
continuous crawl
8
coverage tuning
58
crawl
compressed files
37
configuring
34–
38
continuous
8
coverage tuning
58
crawl query, SQL
77
databases
38,
44,
72–
85
depth
96
description
7
do not crawl URLs
36
duplicate content
61
duplicate hosts
61
end of
21
excluded content
10
excluding directories
33
excluding URLs
36
file shares
17
follow URLs
35
freshness tuning
58
Google regular expressions
37
JavaScript
63–
65
modes
8,
42
monitoring
45
new
16
overview
13–
20
path
12
patterns
11,
19,
21,
23,
24,
25,
26,
34,
35,
36,
86–
96
pausing
43
preparing
28–
41
prohibiting
11
proxy servers
60
queue
14,
15–
16,
17,
19,
21
recrawl
16,
21,
43
resuming
43
scheduled
8,
42
secure content
40
slow rate
48–
49
SMB URLs
38–
40
start URLs
34
status messages
47
stopping
43
testing patterns
37
URLs to crawl frequently
21
D
database crawl
advanced settings
78,
80
settings
76
troubleshooting
82–
83
URL patterns
81
database feed
75
database stylesheet
78
database synchronization
82
databases
10
crawling
38,
44,
72–
85
data source information
76
supported
73
synchronizing
74
date formats, metadata
60
dates
configuring
40
document date rules
41
DB2
73
depth of crawl
96
directories
matching in URLs
89
directories, excluding from crawl
33
do not crawl URLs
36
do not follow patterns
86
document dates
24
documents
amount indexed
59
removing from index
25
removing from servers
26
dollar character
90,
92
Domain Name Services (DNS)
40
domains, matching
89
duplicate hosts
61
E
entity recognition
65–
66
exception patterns
94
F
feeds
database
73,
75
monitoring
82
stylesheet
78
web
22
XML
78
file shares
17,
33
files
matching in URLs
90
size
18
type
18
follow URLs
35,
86
freshness tuning
58
G
Google regular expressions
37,
94–
95
googleoff and googleon tags
31
gsa-crawler
57
H
host load
48,
61
hostname resolution
40
HTML documents
18
HTTP status codes
51
I
If-Modified-Since headers
17
index
7,
8,
16,
18,
19,
21,
22,
25,
26,
31,
35,
47,
59,
62
Index > Collections page
62
Index > Diagnostics > Content Statistics page
46
Index > Diagnostics > Index Diagnostics page
19,
21,
22,
43,
46,
47,
50,
51,
62
Index > Document Dates page
41
Index > Entity Recognition page
65,
66
Index > Index Settings page
59,
60
index pages
56
indexing, wildcard
71
infinite space detection
61
J
Java servlet containers
54
JavaScript crawling
63
L
license
expiration
24
limit
21,
22–
24,
30,
36,
75,
77
links
19
Lotus Domino Enterprise
54
M
metadata
date formats
60
entity recognition
65–
66
excluding names
59
including names
59
indexing
59–
60
multivalued separators
60
Microsoft Commerce Server
54
multivalued separators
60
MySQL
73
N
network
connectivity
22,
47
file shares
10
problems
49
new crawl
16
noarchive
30,
31
no_crawl directories
33
nofollow
30,
31
noindex
30,
31
non-HTML content
48
O
OpenDocument
55
Oracle
73
P
Pattern Tester Utility page
37
patterns, URL
12,
19,
21,
23,
24,
25,
26,
62,
81,
86–
96
ports, matching in URLs
91
prefix option in URLs
91
protocols, matching in URLs
91
proxy servers
60
public web content
9
Q
query load
49
queue, crawl
14–
16,
17,
19,
21
R
recrawl
16,
21,
43,
50
redirects
cyclic
53
logical
63
regular expressions
88
removing
documents from index
25–
27,
62
documents from servers
26
robots meta tag
11,
30
robots.txt file
11,
28–
29
S
scheduled crawl
8
Search > Search Features > Dynamic Navigation page
59
Search > Search Features > Front Ends page
26
search results
database
76
delay after crawl
21
secure content
40
secure web content
9
SMB URLs
38–
40,
93
SQL crawl query
77
SQL serve query
77
SQL Server
73
start URLs
34,
86
status messages
47
strings, matching
93
suffix option in URLs
92
Sun Java System Web Server
54
Sybase
73
synchronizing a database
74
T
TableCrawler
73
TableServer
73
test network connectivity
47
text, excluding from the index
31
U
unlinked URLs
12,
33
URLs
#suffixes
55
BroadVision web server
53
ColdFusion application server
56
crawl frequently
21
directory names
89
domain names
89
exception patterns
94
fetching
16
file names
90
following
19
generated URLs, database crawl
75
index pages
56
Java servlet containers
54
Lotus Domino Enterprise
54
maximum number crawled
24
Microsoft Commerce Server
54
multiple versions
55
OpenDocument
55
patterns
19,
21,
23,
24,
25,
26,
34,
35,
36,
62,
81,
86–
96
ports
91
prefix option
91
priority in the crawl queue
15
protocols
91
rewrite rules
53–
56
SMB
38–
40,
93
specific matches
92
start URLs
34
suffix option
92
Sun Java Web Server
54
testing patterns
37
unlinked
12,
33
user agent
57
V
valid URL patterns
87
W
wait times
50
web servers
49
errors
50
wildcard indexing
71
X
X-Robots-Tag
31