Preparing for a Crawl
Preparing Data for a Crawl
Search appliance administrator, webmaster, content owner, and/or content server administrator |
|
Using robots.txt to Control Access to a Content Server
The Google Search Appliance always obeys the rules in robots.txt (see Content Prohibited by a robots.txt File) and it is not possible to override this feature. However, this type of file is not mandatory. When a robots.txt file present, it is located in the Web server’s root directory. For the search appliance to be able to access the robot.txt file, the file must be public.
Before the search appliance crawls any content servers in your environment, check with the content server administrator or webmaster to ensure that robots.txt allows the search appliance user agent access to the appropriate content (see Identifying the User Agent). For the search appliance to be able to access to the robot.txt file, the file must be public.
If any hosts require authentication before serving robots.txt, you must configure authentication credentials using the Content Sources > Web Crawl > Secure Crawl > Crawler Access page in the Admin Console.
A robots.txt file identifies a crawler as the User-Agent, and includes one or more Disallow: or Allow: (see Using the Allow Directive) directives, which inform the crawler of the content to be ignored. The following example shows a robots.txt file:
User-agent: gsa-crawler
Disallow: /personal_records/
User-agent: gsa-crawler identifies the Google Search Appliance crawler. Disallow: tells the crawler not to crawl and index content in the /personal_records/ path.
User-agent: gsa-crawler
Disallow: /
To allow the search appliance crawler to crawl and index all of the content in a site, use Disallow: without a value, as shown in the following example:
User-agent: gsa-crawler
Disallow:
Using the Allow Directive
In Google Search Appliance software versions 4.6.4.G.44 and later, the search appliance user agent (gsa-crawler, see Identifying the User Agent) obeys an extension to the robots.txt standard called “Allow.” This extension may not be recognized by all other search engine crawlers, so check with other search engines you’re interested in finding out. The Allow: directive works exactly like the Disallow: directive. Simply list a directory or page you want to allow.
You may want to use Disallow: and Allow: together. For example, to block access to all pages in a subdirectory except one, use the following entries:
User-Agent: gsa-crawler
Disallow: /folder1/
Allow: /folder1/myfile.html
This blocks all pages inside the folder1 directory except for myfile.html.
Caching robots.txt
To clear the robots.txt file from cache and refresh it:
1.
|
Choose Administration > Network Settings.
|
2.
|
Change the DNS Servers settings.
|
3.
|
Click Update Settings and Perform Diagnostics.
|
4.
|
Restore the original DNS Servers settings.
|
5.
|
Click Update Settings and Perform Diagnostics.
|
Using Robots meta Tags to Control Access to a Web Page
The none tag is equal to <meta name="robots" content="noindex, nofollow, noarchive"/>. |
You can combine any or all of the keywords in a single meta tag, for example:
<meta name="robots" content="noarchive, nofollow"/>
<a href="test1.html" rel="nofollow>no follow</a>
Currently, it is not possible to set name="gsa-crawler" to limit these restrictions to the search appliance.
Using X-Robots-Tag to Control Access to Non-HTML Documents
While the robots meta tag gives you control over HTML pages, the X-Robots-Tag directive in an HTTP header response gives you control of other types of documents, such as PDF files.
For example, the following HTTP response with an X-Robots-Tag instructs the crawler not to index a page:
HTTP/1.1 200 OK
Date: Tue, 25 May 2010 21:42:43 GMT
(…)
X-Robots-Tag: noindex
(…)
The Google Search Appliance supports the X-Robots-Tag directives listed in the following table.
Do not show this page in search results and do not show a “Cached” link in search results. |
||
Excluding Unwanted Text from the Index
For details about each googleon/googleoff flag, refer to the following table.
There must be a space or newline before the googleon tag.
If URL1 appears on page URL2 within googleoff and googleon tags, the search appliance still extracts the URL and adds it to the link structure. For example, the query link:URL2 still contains URL1 in the result set, but depending on which googleoff option you use, you do not see URL1 when viewing the cached version, searching using the anchor text, and so on. If you want the search appliance not to follow the links and ignore the link structure, follow the instructions in Using Robots meta Tags to Control Access to a Web Page.
Using no-crawl Directories to Control Access to Files and Subdirectories
Preparing Shared Folders in File Systems
In GSA release 7.4, on-board file system crawling (File System Gateway) is deprecated. For more information, see Deprecation Notices. |
Ensuring that Unlinked URLs Are Crawled
Configuring a Crawl
Before starting a crawl, you must configure the crawl path so that it only includes information that your organization wants to make available in search results. To configure the crawl, use the Content Sources > Web Crawl > Start and Block URLs page in the Admin Console to enter URLs and URL patterns in the following boxes:
Note: URLs are case-sensitive.
•
|
Make sure all patterns in the field Follow Patterns specify yourcompany.com as the domain name.
|
Error: Malformed HTTP header: empty content.
To crawl documents when this happens, you can add a header on the Content Sources > Web Crawl > HTTP Headers page of the Admin Console. In the Additional HTTP Headers for Crawler field, add:
Accept-Encoding: identity
For complete information about the Content Sources > Web Crawl > Start and Block URLs page, click Admin Console Help > Content Sources > Web Crawl > Start and Block URLs in the Admin Console.
Start URLs
Start URLs must be fully qualified URLs in the following format:
<protocol>://<host>{:port}/{path}
The information in the curly brackets is optional. The forward slash “/” after <host>{:port} is required.
Typically, start URLs include your company’s home site, as shown in the following example:
http://mycompany.com/
The following example shows a valid start URL:
http://www.example.com/help/
The following table contains examples of invalid URLs
Invalid because the hostname is not fully qualified. A fully qualified hostname includes the local hostname and the full domain name. For example: mail.corp.company.com. |
|
The “/” after <host>[:port] is required. |
The search appliance attempts to resolve incomplete path information entered, using the information entered on the Administration > Network Settings page in the DNS Suffix (DNS Search Path) section. However, if it cannot be successfully resolved, the following error message displays in red on the page:
You have entered one or more invalid start URLs. Please check your edits.
The crawler will retry several times to crawl URLs that are temporarily unreachable.
These URLs are only the starting point(s) for the crawl. They tell the crawler where to begin crawling. However, links from the start URLs will be followed and indexed only if they match a pattern in Follow Patterns. For example, if you specify a starting URL of http://mycompany.com/ in this section and a pattern www.mycompany.com/ in the Follow Patterns section, the crawler will discover links in the http://www.mycompany.com/ web page, but will only crawl and index URLs that match the pattern www.mycompany.com/.
Enter start URLs in the Start URLs section on the Content Sources > Web Crawl > Start and Block URLs page in the Admin Console. To crawl content from multiple websites, add start URLs for them.
Follow Patterns
The following example shows a follow and crawl URL pattern:
http://www.example.com/help/
http://www.example.com/help/two.html
http://www.example.com/help/three.html
http://www.example.com/us/three.html
URLs that are in a specific directory or in one of its subdirectories |
||
For more information about writing URL patterns, see Constructing URL Patterns.
Enter follow and start URL patterns in the Follow Patterns section on the Content Sources > Web Crawl > Start and Block URLs page in the Admin Console.
Do Not Follow Patterns
Enter do not crawl URL patterns in the Do Not Follow Patterns section on the Content Sources > Web Crawl > Start and Block URLs page in the Admin Console.
#.xls$
.xls$
Crawling and Indexing Compressed Files
To enable the search appliance to crawl these types of compressed files, use the Do Not Follow Patterns section on the Content Sources > Web Crawl > Start and Block URLs page in the Admin Console. Put a "#" in front of the following patterns:
•
|
•
|
•
|
Testing Your URL Patterns
To confirm that URLs can be crawled, you can use the Pattern Tester Utility page. This page finds which URLs will be matched by the patterns you have entered for:
To use the Pattern Tester Utility page, click Test these patterns on the Content Sources > Web Crawl > Start and Block URLs page. For complete information about the Pattern Tester Utility page, click Admin Console Help > Content Sources > Web Crawl > Start and Block URLs in the Admin Console.
Using Google Regular Expressions as Crawl Patterns
•
|
The asterisk (*) is a valid wildcard character in both GNU regular expressions and the Robots Exclusion Protocol, and can be used in the Admin Console or in robots.txt.
|
•
|
The $ and ^ characters indicate the end or beginning of a string, respectively, in GNU regular expressions, and can be used in the Admin Console. They are not valid delimiters for a string in the Robots Exclusions Protocol, however, and cannot be used as anchors in robots.txt.
|
Configuring Database Crawl
In GSA release 7.4, the on-board database crawler is deprecated. For more information, see Deprecation Notices. |
To configure a database crawl, provide database data source information by using the Create New Database Source section on the Content Sources > Databases page in the Admin Console.
For information about configuring a database crawl, refer to Providing Database Data Source Information.
About SMB URLs
In GSA release 7.4, on-board file system crawling (File System Gateway) is deprecated. For more information, see Deprecation Notices. |
Use the following format for an SMB URL:
smb://string1/string2/...
The following example shows a valid SMB URL for crawl:
smb://fileserver.mycompany.com/mysharemydir/mydoc.txt
Indicates the network protocol that is used to access the object. |
||
Specifies the DNS host name. A hostname can be one of the following: |
||
Specifies the path to the document, relative to the root share. |
If myshare on myhost.mycompany.com shares all the documents under the C:\myshare directory, the file C:\myshare\mydir\mydoc.txt is retrieved by the following: smb://myhost.mycompany.com/myshare/mydir/mydoc.txt |
|
Microsoft Windows style: C:\myshare\ SMB URL: smb://myhost.mycompany.com/myshare/ |
In addition, ensure that the file server accepts inbound TCP connections on ports 139, 445. Port 139 is used to send NETBIOS requests for SMB crawling and port 445 is used to send Microsoft CIFS requests for SMB crawling. These ports on the file server need to be accessible by the search appliance. For information about checking the accessibility of these ports on the file server, see Authentication Required (401) or Document Not Found (404) for SMB File Share Crawls.
Unsupported SMB URLs
Some SMB file share implementations allow:
The file system crawler does not support these URL schemes.
SMB URLs for Non-file Objects
Files and subdirectories contained within the share’s top-level directory |
Hostname Resolution
Setting Up the Crawler’s Access to Secure Content
The information in this document describes crawling public content. For information about setting up the crawler’s access to secure content, see the Overview in Managing Search for Controlled-Access Content.
Configuring Searchable Dates
For dates to be properly indexed and searchable by date range, they must be in ISO 8601 format:
YYYY-MM-DD
The following example shows a date in ISO 8601 format:
2007-07-11
<meta name="date" content="2007-07-11">
The date in the following meta tag cannot be indexed because there is additional content:
<meta name="date" content="2007-07-11 is a date">
Defining Document Date Rules
Documents can have dates explicitly stated in these places:
•
|
•
|
To define a rule that the search appliance crawler should use to locate document dates (see How Are Document Dates Handled?) in documents for a particular URL, use the Index > Document Dates page in the Admin Console. If you define more than one document date rule for a URL, the search appliance finds all the matching dates from document and uses the first matching rule (from top to bottom) as its document date.
1.
|
2.
|
In the Host or URL Pattern box, enter the host or URL pattern for which you want to set the rule.
|
3.
|
Use the Locate Date In drop-down list to select the location of the date for the document in the specified URL pattern.
|
4.
|
If you select Meta Tag, specify the name of the tag in the Meta Tag Name box. Make sure that you find a meta tag in your HTML. For example, for the tag <meta name="publication_date">, enter “publication_date” in the Meta Tag Name box.
|
5.
|
To add another date rule, click Add More Lines, and add the rule.
|
6.
|
Click Save. This triggers the Documents Dates process to run.
|
For complete information about the Document Dates page, click Admin Console Help > Index > Document Dates in the Admin Console.