7.4 - Preparing for a Crawl


Objective	Role
Control access to a content server	Content server administrator, webmaster
Control access to a Web page	Search appliance administrator, webmaster, content owner, and/or content server administrator
Control indexing of parts of a Web page
Control access to files and subdirectories
Ensure that the search appliance can crawl a file system

Using robots.txt to Control Access to a Content Server

The Google Search Appliance always obeys the rules in robots.txt (see Content Prohibited by a robots.txt File) and it is not possible to override this feature. However, this type of file is not mandatory. When a robots.txt file present, it is located in the Web server’s root directory. For the search appliance to be able to access the robot.txt file, the file must be public.

Before the search appliance crawls any content servers in your environment, check with the content server administrator or webmaster to ensure that robots.txt allows the search appliance user agent access to the appropriate content (see Identifying the User Agent). For the search appliance to be able to access to the robot.txt file, the file must be public.

If any hosts require authentication before serving robots.txt, you must configure authentication credentials using the Content Sources > Web Crawl > Secure Crawl > Crawler Access page in the Admin Console.

A robots.txt file identifies a crawler as the User-Agent, and includes one or more Disallow: or Allow: (see Using the Allow Directive) directives, which inform the crawler of the content to be ignored. The following example shows a robots.txt file:

User-agent: gsa-crawler

Disallow: /personal_records/

User-agent: gsa-crawler identifies the Google Search Appliance crawler. Disallow: tells the crawler not to crawl and index content in the /personal_records/ path.

To tell the search appliance crawler to ignore all of the content in a site, use the following syntax:

User-agent: gsa-crawler

Disallow: /

To allow the search appliance crawler to crawl and index all of the content in a site, use Disallow: without a value, as shown in the following example:

User-agent: gsa-crawler

Disallow:

Using the Allow Directive

In Google Search Appliance software versions 4.6.4.G.44 and later, the search appliance user agent (gsa-crawler, see Identifying the User Agent) obeys an extension to the robots.txt standard called “Allow.” This extension may not be recognized by all other search engine crawlers, so check with other search engines you’re interested in finding out. The Allow: directive works exactly like the Disallow: directive. Simply list a directory or page you want to allow.

You may want to use Disallow: and Allow: together. For example, to block access to all pages in a subdirectory except one, use the following entries:

User-Agent: gsa-crawler

    Disallow: /folder1/

    Allow: /folder1/myfile.html

This blocks all pages inside the folder1 directory except for myfile.html.

Caching robots.txt

The Google Search Appliance caches robots.txt file for 30 minutes. You can clear the robots.txt file from cache and refresh it by changing the DNS Servers settings and then restoring them.

To clear the robots.txt file from cache and refresh it:

Choose Administration > Network Settings.

Change the DNS Servers settings.

Click Update Settings and Perform Diagnostics.

Restore the original DNS Servers settings.

Click Update Settings and Perform Diagnostics.

Using Robots meta Tags to Control Access to a Web Page

To prevent the search appliance crawler (as well as other crawlers) from indexing or following links in a specific HTML document, embed a robots meta tag in the head of the document. The search appliance crawler obeys the noindex, nofollow, noarchive, and none keywords in meta tags. Refer to the following table for details about Robots meta tags, including examples.


Keyword	Description	Example
noindex	The search appliance crawler does not archive the document in the search appliance cache or index it. The document is not counted as part of the license limit.	<meta name="robots" content="noindex"/>
nofollow	The search appliance crawler retrieves and archives the document in the search appliance cache, but does not follow links on the Web page to other documents. The document is counted as part of the license limit.	<meta name="robots" content="nofollow"/>
noarchive	The search appliance crawler retrieves and indexes the document, but does not archive it in its cache. The document is counted as part of the license limit.	<meta name="robots" content="noarchive"/>
none	The none tag is equal to <meta name="robots" content="noindex, nofollow, noarchive"/>.	<meta name="robots" content="none"/>

You can combine any or all of the keywords in a single meta tag, for example:

<meta name="robots" content="noarchive, nofollow"/>

Even if a robots meta tag contains words other than noindex, nofollow, noarchive, and none, if the keywords appear between separators, such as commas, the search appliance is able to extract that keyword correctly.

Also, you can include rel="nofollow" in an anchor tag, which causes the search appliance to ignore the link. For example:

<a href="test1.html" rel="nofollow>no follow</a>

Currently, it is not possible to set name="gsa-crawler" to limit these restrictions to the search appliance.

If the search encounters a robots meta tag when fetching a URL, it schedules a retry after a certain time interval. For URLs excluded by robots meta tags, the maximum retry interval is one month.

Using X-Robots-Tag to Control Access to Non-HTML Documents

While the robots meta tag gives you control over HTML pages, the X-Robots-Tag directive in an HTTP header response gives you control of other types of documents, such as PDF files.

For example, the following HTTP response with an X-Robots-Tag instructs the crawler not to index a page:

HTTP/1.1 200 OK

Date: Tue, 25 May 2010 21:42:43 GMT

(…)

X-Robots-Tag: noindex

(…)

The Google Search Appliance supports the X-Robots-Tag directives listed in the following table.


Directive	Description	Example
noindex	Do not show this page in search results and do not show a “Cached” link in search results.	X-Robots-Tag: noindex
nofollow	Do not follow the links on this page.	X-Robots-Tag: nofollow
noarchive	Do not show a “Cached” link in search results.	X-Robots-Tag: noarchive

Excluding Unwanted Text from the Index

There may be Web pages that you want to suppress from search results when users search on certain words or phrases. For example, if a Web page consists of the text “the user conference page will be completed as soon as Jim returns from medical leave,” you might not want this page to appear in the results of a search on the terms “user conference.”

You can prevent this content from being indexed by using googleoff/googleon tags. By embedding googleon/googleoff tags with their flags in HTML documents, you can disable:

•

The indexing of a word or portion of a Web page

•

The indexing of anchor text

•

The use of text to create a snippet in search results

For details about each googleon/googleoff flag, refer to the following table.


Flag	Description	Example	Results
index	Words between the tags are not indexed as occurring on the current page.	fish <!--googleoff: index-->shark <!--googleon: index-->mackerel	The words fish and mackerel are indexed for this page, but the occurrence of shark is not indexed. This page could appear in search results for the term shark only if the word appears elsewhere on the page or in anchortext for links to the page. But the word shark could appear in a result snippet. Hyperlinks that appear within these tags are followed.
anchor	Anchor text that appears between the tags and in links to other pages is not indexed. This prevents the index from using the hyperlink to associate the link text with the target page in search results.	<!--googleoff: anchor--> <A href=sharks_rugby.html> shark </A> <!--googleon: anchor-->	The word shark is not associated with the page sharks_rugby.html. Otherwise this hyperlink would cause the page sharks_rugby.html to appear in the search results for the term shark. Hyperlinks that appear within these tags are followed, so sharks_rugby.html is still crawled and indexed.
snippet	Text between the tags is not used to create snippets for search results.	<!--googleoff: snippet-->Come to the fair! <A href=sharks_rugby.html>shark</A> <!--googleon: snippet-->	The text ("Come to the fair!" and "shark") does not appear in snippets with the search results, but the words will still be indexed and searchable. Also, the link sharks_rugby.html will still be followed. The URL sharks_rugby.html will also appear in the search results for the term shark.
all	Turns off all the attributes. Text between the tags is not indexed, is not associated with anchor text, or used for a snippet.	<!--googleoff: all-->Come to the fair! <!--googleon: all-->	The text Come to the fair! is not indexed, is not associated with anchor text, and does not appear in snippets with the search results.

There must be a space or newline before the googleon tag.

If URL1 appears on page URL2 within googleoff and googleon tags, the search appliance still extracts the URL and adds it to the link structure. For example, the query link:URL2 still contains URL1 in the result set, but depending on which googleoff option you use, you do not see URL1 when viewing the cached version, searching using the anchor text, and so on. If you want the search appliance not to follow the links and ignore the link structure, follow the instructions in Using Robots meta Tags to Control Access to a Web Page.

Using no-crawl Directories to Control Access to Files and Subdirectories

The Google Search Appliance does not crawl any directories named “no_crawl.” You can prevent the search appliance from crawling files and directories by:

Creating a directory called “no_crawl.”

Putting the files and subdirectories you do not want crawled under the no_crawl directory.

This method blocks the search appliance from crawling everything in the no_crawl directory, but it does not provide directory security or block people from accessing the directory.

End users can also use no_crawl directories on their local computers to prevent personal files and directories from being crawled.

Preparing Shared Folders in File Systems


In GSA release 7.4, on-board file system crawling (File System Gateway) is deprecated. For more information, see Deprecation Notices.

In a Windows network file system, folders and drives can be shared. A shared folder or drive is available for any person, device, or process on the network to use. To enable the Google Search Appliance to crawl your file system, do the following:

Set the properties of appropriate folders and drives to “Share this folder.”

Check that the content to be crawled is in the appropriate folders and drives.

Ensuring that Unlinked URLs Are Crawled

The Google Search Appliance crawls content by following newly discovered links in pages that it crawls. If your enterprise content includes unlinked URLs that are not listed in the follow and crawl patterns, the search appliance crawler will not find them on its own. In addition to adding unlinked URLs to follow and crawl patterns, you can force unlinked URLs into a crawl by using a jump page, which lists any URLs and links that you want the search appliance crawl to discover.

A jump page allows users or crawlers to navigate all the pages within a Web site. To include a jump page in the crawl, add the URL for the page to the crawl path.

Configuring a Crawl

Before starting a crawl, you must configure the crawl path so that it only includes information that your organization wants to make available in search results. To configure the crawl, use the Content Sources > Web Crawl > Start and Block URLs page in the Admin Console to enter URLs and URL patterns in the following boxes:

•

Start URLs

•

Follow Patterns

•

Do Not Follow Patterns

Note: URLs are case-sensitive.

If the search appliance should never crawl outside of your intranet site, then Google recommends that you take one or more of the following actions:

•

Configure your network to disallow search appliance connectivity outside of your intranet.

If you want to make sure that the search appliance never crawls outside of your intranet, then a person in your IT/IS group needs to specifically block the search appliance IP addresses from leaving your intranet.

•

Make sure all patterns in the field Follow Patterns specify yourcompany.com as the domain name.

Note: Some content servers do not respond correctly to crawl requests from the search appliance. When this happens, the URL state on the Admin Console may appear as:

Error: Malformed HTTP header: empty content.

To crawl documents when this happens, you can add a header on the Content Sources > Web Crawl > HTTP Headers page of the Admin Console. In the Additional HTTP Headers for Crawler field, add:

Accept-Encoding: identity

For complete information about the Content Sources > Web Crawl > Start and Block URLs page, click Admin Console Help > Content Sources > Web Crawl > Start and Block URLs in the Admin Console.

Start URLs

Start URLs control where the Google Search Appliance begins crawling your content. The search appliance should be able to reach all content that you want to include in a particular crawl by following the links from one or more of the start URLs. Start URLs are required.

Start URLs must be fully qualified URLs in the following format:

<protocol>://<host>{:port}/{path}

The information in the curly brackets is optional. The forward slash “/” after <host>{:port} is required.

Typically, start URLs include your company’s home site, as shown in the following example:

http://mycompany.com/

The following example shows a valid start URL:

http://www.example.com/help/

The following table contains examples of invalid URLs


Invalid examples	Reason:
http://www/	Invalid because the hostname is not fully qualified. A fully qualified hostname includes the local hostname and the full domain name. For example: mail.corp.company.com.
www.example.com/	Invalid because the protocol information is missing.
http://www.example.com	The “/” after <host>[:port] is required.

The search appliance attempts to resolve incomplete path information entered, using the information entered on the Administration > Network Settings page in the DNS Suffix (DNS Search Path) section. However, if it cannot be successfully resolved, the following error message displays in red on the page:

You have entered one or more invalid start URLs. Please check your edits.

The crawler will retry several times to crawl URLs that are temporarily unreachable.

These URLs are only the starting point(s) for the crawl. They tell the crawler where to begin crawling. However, links from the start URLs will be followed and indexed only if they match a pattern in Follow Patterns. For example, if you specify a starting URL of http://mycompany.com/ in this section and a pattern www.mycompany.com/ in the Follow Patterns section, the crawler will discover links in the http://www.mycompany.com/ web page, but will only crawl and index URLs that match the pattern www.mycompany.com/.

Enter start URLs in the Start URLs section on the Content Sources > Web Crawl > Start and Block URLs page in the Admin Console. To crawl content from multiple websites, add start URLs for them.

Follow Patterns

Follow and crawl URL patterns control which URLs are crawled and included in the index. Before crawling any URLs, the Google Search Appliance checks them against follow and crawl URL patterns. Only URLs that match these URL patterns are crawled and indexed. You must include all start URLs in follow and crawl URL patterns.

The following example shows a follow and crawl URL pattern:

http://www.example.com/help/

Given this follow and crawl URL pattern, the search appliance crawls the following URLs because each one matches it:

http://www.example.com/help/two.html

http://www.example.com/help/three.html

However, the search appliance does not crawl the following URL because it does not match the follow and crawl pattern:

http://www.example.com/us/three.html

The following table provides examples of how to use follow and crawl URL patterns to match sites, directories, and specific URLs.


To Match	Expression Format	Example
A site	<site>/	www.mycompany.com/
URLs from all sites in the same domain	<domain>/	mycompany.com/
URLs that are in a specific directory or in one of its subdirectories	<site>/<directory>/	sales.mycompany.com/products/
A specific file	<site>/<directory>/<file>	www.mycompany.com/products/index.html

For more information about writing URL patterns, see Constructing URL Patterns.

Enter follow and start URL patterns in the Follow Patterns section on the Content Sources > Web Crawl > Start and Block URLs page in the Admin Console.

Do Not Follow Patterns

Do not follow patterns exclude URLs from being crawled and included in the index. If a URL contains a do not crawl pattern, the Google Search Appliance does not crawl it. Do not crawl patterns are optional.

Enter do not crawl URL patterns in the Do Not Follow Patterns section on the Content Sources > Web Crawl > Start and Block URLs page in the Admin Console.

To prevent specific file types, directories, or other sets of pages from being crawled, enter the appropriate URLs in this section. Using this section, you can:

•

Prevent certain URLs, such as email links, from consuming your license limit.

•

Protect files that you do not want people to see.

•

Save time while crawling by eliminating searches for objects such as MP3 files.

For your convenience, this section is prepopulated with many URL patterns and file types, some of which you may not want the search appliance to index. To make a pattern or file type unavailable to the search appliance crawler, remove the # (comment) mark in the line containing the file type. For example, to make Excel files on your servers unavailable to the crawler, change the line

#.xls$

.xls$

Crawling and Indexing Compressed Files

The search appliance supports crawling and indexing compressed files in the following formats: .zip, .tar, .tar.gz, and .tgz.

To enable the search appliance to crawl these types of compressed files, use the Do Not Follow Patterns section on the Content Sources > Web Crawl > Start and Block URLs page in the Admin Console. Put a "#" in front of the following patterns:

•

•

•

•

•

regexpIgnoreCase:([^.]..|[^p].|[^s])[.]gz$

Testing Your URL Patterns

To confirm that URLs can be crawled, you can use the Pattern Tester Utility page. This page finds which URLs will be matched by the patterns you have entered for:

•

Follow Patterns

•

Do Not Follow Patterns

To use the Pattern Tester Utility page, click Test these patterns on the Content Sources > Web Crawl > Start and Block URLs page. For complete information about the Pattern Tester Utility page, click Admin Console Help > Content Sources > Web Crawl > Start and Block URLs in the Admin Console.

Using Google Regular Expressions as Crawl Patterns

The search appliance’s Admin Console accepts Google regular expressions (similar to GNU regular expressions) as crawl patterns, but not all of these are valid in the Robots Exclusion Protocol. Therefore, the Admin Console does not accept Robots Exclusion Protocol patterns that are not valid Google regular expressions. Similarly, Google or GNU regular expressions cannot be used in robots.txt unless they are valid under the Robots Exclusion Protocol.

Here are some examples:

•

The asterisk (*) is a valid wildcard character in both GNU regular expressions and the Robots Exclusion Protocol, and can be used in the Admin Console or in robots.txt.

•

The $ and ^ characters indicate the end or beginning of a string, respectively, in GNU regular expressions, and can be used in the Admin Console. They are not valid delimiters for a string in the Robots Exclusions Protocol, however, and cannot be used as anchors in robots.txt.

•

The “Disallow” directive is used in robots.txt to indicate that a resource should not be visited by web crawlers. However, “Disallow” is not a valid directive in Google or GNU regular expressions, and cannot be used in the Admin Console.

Configuring Database Crawl


In GSA release 7.4, the on-board database crawler is deprecated. For more information, see Deprecation Notices.

To configure a database crawl, provide database data source information by using the Create New Database Source section on the Content Sources > Databases page in the Admin Console.

For information about configuring a database crawl, refer to Providing Database Data Source Information.

About SMB URLs


In GSA release 7.4, on-board file system crawling (File System Gateway) is deprecated. For more information, see Deprecation Notices.

As when crawling HTTP or HTTPS web-based content, the Google Search Appliance uses URLs to refer to individual objects that are available on SMB-based file systems, including files, directories, shares, hosts.

Use the following format for an SMB URL:

smb://string1/string2/...

When the crawler sees a URL in this format, it treats string1 as the hostname and string2 as the share name, with the remainder as the path within the share. Do not enter a workgroup in an SMB URL.

The following example shows a valid SMB URL for crawl:

smb://fileserver.mycompany.com/mysharemydir/mydoc.txt

The following table describes all of the required parts of a URL that are used to identify an SMB-based document.


URL Component	Description	Example
Protocol	Indicates the network protocol that is used to access the object.	smb://
Hostname	Specifies the DNS host name. A hostname can be one of the following:
	A fully qualified domain name	fileserver.mycompany.com
	An unqualified hostname	fileserver
	An IP Address	10.0.0.100
Share name	Specifies the name of the share to use. A share is tied to a particular host, so two shares with the same name on different hosts do not necessarily contain the same content.	myshare
File path	Specifies the path to the document, relative to the root share.	If myshare on myhost.mycompany.com shares all the documents under the C:\myshare directory, the file C:\myshare\mydir\mydoc.txt is retrieved by the following: smb://myhost.mycompany.com/myshare/mydir/mydoc.txt
Forward slash	SMB URLs use forward slashes only. Some environments, such as Microsoft Windows systems, use backslashes (“\”) to separate file path components. Even if you are referring to documents in such an environment, use forward slashes for this purpose.	Microsoft Windows style: C:\myshare\ SMB URL: smb://myhost.mycompany.com/myshare/

In addition, ensure that the file server accepts inbound TCP connections on ports 139, 445. Port 139 is used to send NETBIOS requests for SMB crawling and port 445 is used to send Microsoft CIFS requests for SMB crawling. These ports on the file server need to be accessible by the search appliance. For information about checking the accessibility of these ports on the file server, see Authentication Required (401) or Document Not Found (404) for SMB File Share Crawls.

Unsupported SMB URLs

Some SMB file share implementations allow:

•

URLs that omit the hostname

•

URLs with workgroup identifiers in place of hostnames

The file system crawler does not support these URL schemes.

SMB URLs for Non-file Objects

SMB URLs can refer to objects other than files, including directories, shares, and hosts. The file system gateway, which interacts with the network file shares, treats these non-document objects like documents that do not have any content, but do have links to certain other objects. The following table describes the correspondence between objects that the URLs can refer to and what they actually link to.


URL Refers To	URL Links To	Example
Directory	Files and subdirectories contained within the directory	smb://fileserver.mycompany.com/myshare/mydir/
Share	Files and subdirectories contained within the share’s top-level directory	smb://fileserver.mycompany.com/myshare/

YYYY-MM-DD

The following example shows a date in ISO 8601 format:

2007-07-11

For a date in a meta tag to be indexed, not only must it be in ISO 8601 format, it must also be the only value in the content. For example, the date in the following meta tag can be indexed:

<meta name="date" content="2007-07-11">

The date in the following meta tag cannot be indexed because there is additional content:

<meta name="date" content="2007-07-11 is a date">

Defining Document Date Rules

Documents can have dates explicitly stated in these places:

•

URL

•

Title

•

Body of the document

•

meta tags of the document

•

Last-modified date from the HTTP response

To define a rule that the search appliance crawler should use to locate document dates (see How Are Document Dates Handled?) in documents for a particular URL, use the Index > Document Dates page in the Admin Console. If you define more than one document date rule for a URL, the search appliance finds all the matching dates from document and uses the first matching rule (from top to bottom) as its document date.

To configure document dates:

Choose Index > Document Dates. The Document Dates page appears.

In the Host or URL Pattern box, enter the host or URL pattern for which you want to set the rule.

Use the Locate Date In drop-down list to select the location of the date for the document in the specified URL pattern.

If you select Meta Tag, specify the name of the tag in the Meta Tag Name box. Make sure that you find a meta tag in your HTML. For example, for the tag <meta name="publication_date">, enter “publication_date” in the Meta Tag Name box.

To add another date rule, click Add More Lines, and add the rule.

Click Save. This triggers the Documents Dates process to run.

For complete information about the Document Dates page, click Admin Console Help > Index > Document Dates in the Admin Console.

Google Search Appliance Documentation