7.4 - Index

Google Search Appliance Documentation

Administering Crawl

PDF

Previous

Next

Preparing for a Crawl

Running a Crawl

Monitoring and Troubleshooting Crawls

Advanced Topics

Database Crawling and Serving

Constructing URL Patterns

Crawl Quick Reference

Index

Index

Symbols

^ character 91, 92, 94

.* character 94

.tar.gz files 37

$ character 90, 92, 94

Numerics

200 status code 22

301 status code 51

302 status code 51

304 status code 17

401 status code 51, 52

404 status code 51, 52

500 status code 51

501 status code 51

A

Administration > License page 23, 26

Administration > Network Settings page 29, 35

Administration > System Settings page 24

authentication 9, 26, 28, 47, 51

B

backreferences 96

BroadVision Web server 53

C

caret character 91, 92

case sensitivity in URL patterns 88

checksum 16, 17

ColdFusion application server 56

default collection 62

URL patterns 62

comments in URL patterns 88

compressed files, indexing 10, 37

contains prefix 93

amount indexed 59

can be crawled 9–10

compressed files 10

database data source 76

network file shares 10

not crawled 10–11

public web content 9

secure web content 9

Content Sources > Databases page 38, 44, 74–85

Content Sources > Diagnostics > Crawl Queue page 46, 50

Content Sources > Diagnostics > Crawl Status page 24, 43, 46, 48

Content Sources > Feeds page 82

Content Sources > Web Crawl > Coverage Tuning page 58

Content Sources > Web Crawl > Crawl Schedule page 42, 51

Content Sources > Web Crawl > Duplicate Hosts page 61

Content Sources > Web Crawl > Freshness Tuning page 21, 22, 43, 48, 58

Content Sources > Web Crawl > Host Load Schedule page 18, 48, 61

Content Sources > Web Crawl > HTTP Headers page 57

Content Sources > Web Crawl > Proxy Servers page 60

Content Sources > Web Crawl > Secure Crawl > Crawler Access page 28, 52

Content Sources > Web Crawl > Start and Block URLs 21

Content Sources > Web Crawl > Start and Block URLs page 18, 23, 34, 52, 56, 62, 83, 86–87, 96

Content Sources > Web Crawl >Start and Block URLs page 26, 35, 36, 37

continuous crawl 8

coverage tuning 58

compressed files 37

configuring 34–38

coverage tuning 58

crawl query, SQL 77

databases 38, 44, 72–85

do not crawl URLs 36

duplicate content 61

duplicate hosts 61

excluded content 10

excluding directories 33

excluding URLs 36

freshness tuning 58

Google regular expressions 37

JavaScript 63–65

overview 13–20

patterns 11, 19, 21, 23, 24, 25, 26, 34, 35, 36, 86–96

preparing 28–41

proxy servers 60

queue 14, 15–16, 17, 19, 21

recrawl 16, 21, 43

scheduled 8, 42

secure content 40

slow rate 48–49

SMB URLs 38–40

status messages 47

testing patterns 37

URLs to crawl frequently 21

D

advanced settings 78, 80

troubleshooting 82–83

URL patterns 81

database feed 75

database stylesheet 78

database synchronization 82

crawling 38, 44, 72–85

data source information 76

synchronizing 74

date formats, metadata 60

document date rules 41

depth of crawl 96

matching in URLs 89

directories, excluding from crawl 33

do not crawl URLs 36

do not follow patterns 86

document dates 24

amount indexed 59

removing from index 25

removing from servers 26

dollar character 90, 92

Domain Name Services (DNS) 40

domains, matching 89

duplicate hosts 61

E

entity recognition 65–66

exception patterns 94

F

database 73, 75

file shares 17, 33

matching in URLs 90

follow URLs 35, 86

freshness tuning 58

G

Google regular expressions 37, 94–95

googleoff and googleon tags 31

H

host load 48, 61

hostname resolution 40

HTML documents 18

HTTP status codes 51

I

If-Modified-Since headers 17

index 7, 8, 16, 18, 19, 21, 22, 25, 26, 31, 35, 47, 59, 62

Index > Collections page 62

Index > Diagnostics > Content Statistics page 46

Index > Diagnostics > Index Diagnostics page 19, 21, 22, 43, 46, 47, 50, 51, 62

Index > Document Dates page 41

Index > Entity Recognition page 65, 66

Index > Index Settings page 59, 60

indexing, wildcard 71

infinite space detection 61

J

Java servlet containers 54

JavaScript crawling 63

L

limit 21, 22–24, 30, 36, 75, 77

Lotus Domino Enterprise 54

M

date formats 60

entity recognition 65–66

excluding names 59

including names 59

indexing 59–60

multivalued separators 60

Microsoft Commerce Server 54

multivalued separators 60

N

connectivity 22, 47

noarchive 30, 31

no_crawl directories 33

nofollow 30, 31

non-HTML content 48

O

OpenDocument 55

P

Pattern Tester Utility page 37

patterns, URL 12, 19, 21, 23, 24, 25, 26, 62, 81, 86–96

ports, matching in URLs 91

prefix option in URLs 91

protocols, matching in URLs 91

proxy servers 60

public web content 9

Q

queue, crawl 14–16, 17, 19, 21

R

recrawl 16, 21, 43, 50

regular expressions 88

documents from index 25–27, 62

documents from servers 26

robots meta tag 11, 30

robots.txt file 11, 28–29

S

scheduled crawl 8

Search > Search Features > Dynamic Navigation page 59

Search > Search Features > Front Ends page 26

delay after crawl 21

secure content 40

secure web content 9

SMB URLs 38–40, 93

SQL crawl query 77

SQL serve query 77

start URLs 34, 86

status messages 47

strings, matching 93

suffix option in URLs 92

Sun Java System Web Server 54

synchronizing a database 74

T

TableCrawler 73

test network connectivity 47

text, excluding from the index 31

U

unlinked URLs 12, 33

BroadVision web server 53

ColdFusion application server 56

crawl frequently 21

directory names 89

domain names 89

exception patterns 94

generated URLs, database crawl 75

Java servlet containers 54

Lotus Domino Enterprise 54

maximum number crawled 24

Microsoft Commerce Server 54

multiple versions 55

OpenDocument 55

patterns 19, 21, 23, 24, 25, 26, 34, 35, 36, 62, 81, 86–96

prefix option 91

priority in the crawl queue 15

rewrite rules 53–56

SMB 38–40, 93

specific matches 92

suffix option 92

Sun Java Web Server 54

testing patterns 37

unlinked 12, 33

V

valid URL patterns 87

W

wildcard indexing 71

X

X-Robots-Tag 31