Google logo
Google Search Appliance Documentation

Administering Crawl
PDF Previous Next
Database Crawling and Serving

Database Crawling and Serving

Crawling is the process where the Google Search Appliance discovers enterprise content to index. This chapter tells search appliance administrators how to configure database crawling and serving.

Back to top

Database Crawler Deprecation Notice

In GSA release 7.4, the on-board database crawler is deprecated. It will be removed in a future release. If you have configured on-board database crawling for your GSA, install and configure the Google Connector for Databases 4.0.4 or later instead. For more information, see “Deploying the Connector for Databases,” available from the Connector Documentation page.

Back to top

Introduction

This chapter describes how the Google Search Appliance crawls databases and serves results from them. This document is intended for search appliance administrators who need to understand:

The following table lists the major sections in this chapter.

 

Supported Databases

The relational databases that the Google Search Appliance can crawl

Overview of Database Crawling and Serving

How the Google Search Appliance crawls a database, uploads database content for inclusion in the index, and serves and displays search results from a database

Configuring Database Crawling and Serving

How to configure a Google Search Appliance for database crawling and serving

Starting Database Synchronization

How to start the process of crawling a database and uploading content for inclusion in the search index

Troubleshooting

How to troubleshoot a database crawl

Frequently Asked Questions

Questions about database crawling and serving

Back to top

Supported Databases

The Google Search Appliance provides access to data stored in relational databases by crawling the content directly from the database and serving the content. To access content in a database, the Google Search Appliance sends SQL (Structured Query Language) queries using JDBC (Java Database Connectivity) adapters provided by each database company.

The following table lists databases and JDBC adapter versions that the Google Search Appliance supports.

 

DB2®

IBM®DB2 Universal Database (UDB) 8.1.0.64

MySQL®

MySQL Connector/J 3.1.13

Oracle®

Oracle Database 10g Release 2, 10.1.0.2.0 driver

SQL Server™

Microsoft® SQL Server™ 2008 JDBC™ Driver 2.0

Sybase®

jConnect™ for JDBC™ 5.5 Build 25137

Back to top

Overview of Database Crawling and Serving

The Google Search Appliance has two built-in components for crawling and serving content from databases:

TableServer connects to the database when a serve query (see Serve Queries) is defined and the user clicks on a search result from the database.

The following diagram provides an overview of the major database crawl and serve processes:

See the following sections in this document for descriptions of each major process. For an explanation of the symbols used in the diagram, refer to About the Diagrams in this Section.

Synchronizing a Database

The process of crawling a database is called synchronizing a database. Full database synchronizations are always manual and you must start one by using the Content Sources > Databases page. The Google Search Appliance does not currently support scheduled database synchronization.

After you start database synchronization (see Starting Database Synchronization), the TableCrawler and TableServer use JDBC adapters to connect to the specified database. They connect by using information that you provide when you configure database crawling and serving (see Configuring Database Crawling and Serving).

When you start database synchronization, the TableCrawler component of the Google Search Appliance performs the following steps:

Specifically, the TableCrawler sends the database the SQL query that you entered in the Crawl Query field, using the JDBC database client libraries.

The results are wrapped in Feed XML (eXtensible Markup Language) syntax (see The Feed XML and Stylesheet), and include a record for each row of the database crawl query results. This database feed file is presented to the Feeder system as soon as the crawl is complete.

To prevent the Google Search Appliance from deleting database content from the index when its PageRank™ is low and the index has reached its license limit, click the Lock documents checkbox under Create (see Providing Database Data Source Information).

Processing a Database Feed

Once synchronization is complete, the database feed is automatically uploaded to the search appliance for inclusion in the index. The crawl query (see Crawl Queries) is used to produce a feed description. All feeds, including database feeds, share the same namespace. Database source names should not match existing feed names.

Unique Generated URLs

The database synchronization process generates the URL attribute. Note that the IP address of the Google Search Appliance is used rather than the name in the URL.

Pages created from the database being indexed all have the form shown in the following example.

googledb://<database-hostname>/<DB_SOURCE_NAME>/

Therefore, you need to enter this pattern in Follow Patterns on the Content Sources > Web Crawl > Start and Block URLs page in the Admin Console. For more details see Setting the URL Patterns to Enable Database Crawl.

The Feeds connector generates a URL from either the Primary Key columns or from the URL column as specified in Serve URL field on the Content Sources > Databases page. A unique hash value is generated from the primary key to form part of the URL.

These generated URLs have the form shown in the following example.

http://<appliance_hostname>/db/<database-hostname>/<DB_SOURCE_NAME>/<result_of_hash>

Serving Database Content

Once the search appliance index has been updated with the Feed XML file, the content is available for serving in approximately 30 minutes. The following sections describe how the Google Search Appliance performs the following actions:

Generating Search Results

The linked content from search results is generated by the serve query at serve time. If a user’s query returns results that were originally retrieved from a database query, each result links to content that is queried from the database at serve time. The associated snippets are generated from the index.

If the database has changed since the last database synchronization, the resulting page may not relate to the original user search.

If the database serve query has changed since the database was last synchronized, the search results may produce pages that do not match the user’s query.

Displaying Search Results

The TableServer displays search results for a query from the index. The search is made over the indexed content and shows URLs and snippet information from the crawl query results. The result links direct to URLs that are either:

Generated on the search appliance (when Serve Query is used). The URLs are rewritten by the XSLT to be served through the search appliance by using the following format: http://<appliance_hostname>/db/<database-hostname>/<DB_SOURCE_NAME>/<result_of_hash>

or

Obtained from the database (when Serve URL Field is used).

The TableServer is not used if Serve URL Field is selected. Serve URL Field indicates the column in the database that contains a URL to display for each row, when the user clicks on the result of a search.

When Serve URL Field is selected, the database stylesheet is only used to format the database data for indexing and for the snippets. It is not used by the referenced URL from Serve URL Field.

Back to top

Configuring Database Crawling and Serving

Before you can start database synchronization (see Starting Database Synchronization), you must configure database crawling and serving by performing the following tasks:

Providing Database Data Source Information

This information enables the crawler to access content stored in the database and to format search results. Database data source information includes the following items:

Crawl query (see Crawl Queries)—A SQL query for the database that returns all rows to be indexed
Serve query (see Serve Queries)—A SQL statement that returns a row from a table or joined tables which matches a search query.
Data Display/Usage (see The Feed XML and Stylesheet)—The stylesheet used to format the content of the Feed XML document
Advanced Settings (see Configuring Advanced Settings)—Incremental crawl query, BLOB fields

You provide database data source information by using the Create New Database Source section on the Content Sources > Databases page in the Admin Console. To navigate to this page, click Content Sources > Databases.

For complete information about the Create New Database Source section, click Admin Console Help > Content Sources > Databases in the Admin Console.

Crawl Queries

An SQL crawl query must be in the form shown in the following example.

SELECT <table.column> [, <table.column>, ...]
FROM <table> [, <table>, ...
[WHERE some_condition_or_join]

Each row result corresponds to a separate document. The information retrieved from the crawl query provides the data for the indexing.

Serve Queries

An SQL serve query is used when a user clicks on a search result link, to retrieve and display the desired document data from the database.

A serve query displays result data using the ’?’ in the WHERE clause to allow for particular row selection and display. The Primary Key Fields must provide the column names for the field to substitute with the ’?’.

Primary Key Fields are column heading names (separated by commas) such as Last_Name, First_Name, SSN (Social Security Number), Birth_Date. The Primary Key field must provide a unique identifier for a database query result. This may be a combination of column names which produce a unique permutation from the corresponding values. The Primary Key allows each result row from a database query to be reliably identified by the serve query. Primary keys must be listed in exactly the same order as they appear in the WHERE clause.

Crawl and Serve Query Examples

This section shows example crawl and serve queries for an employee database with these fields:

employee_id, first_name, last_name, email, dept

The following example shows the crawl query.

SELECT employee_id, first_name, last_name, email, dept
FROM employee

The following example shows the serve query.

SELECT employee_id, first_name, last_name, email, dept
FROM employee
WHERE employee_id = ?

The Primary Key field for this case must be employee_id. The ’?’ signifies that this value is provided at serve time, from the search result that the user clicks.

For a table with multiple column Primary Keys, if the combination of employee_id, dept is unique, you can use multiple bind variables. The crawl query for this example is the same as shown in this section. The following example shows the serve query.

SELECT employee_id, first_name, last_name, email, dept
FROM employee
WHERE employee_id = ? AND dept = ?

Note:  

The column names specified in the WHERE clause must be included in the same order in the Primary Key Fields.
The Feed XML and Stylesheet

You specify a stylesheet for formatting the content of the Feed XML document by using the stylesheet specified in the Data Display/Usage section of the Content Sources > Databases page in the Admin Console. This stylesheet defines the formatting used between each record.

You can use the default database stylesheet, or upload your own. To view the default database stylesheet, dbdefault.xsl, download it from this link: https://support.google.com/gsa/answer/6069358. You can make changes to it, and then upload it by using the Upload stylesheet selection on the Content Sources > Databases page.

Each row returned by the database is represented by a unique URL in the appliance index.

The following example shows internally stored output of a database sync, using the default database stylesheet. In this example, the database hostname is mydb.mycompany.com and the database source name is DB_SOURCE_NAME.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE gsafeed SYSTEM "http://ent1:7800/gsafeed.dtd">
<gsafeed>
<header>
<datasource>DB_SOURCE_NAME</datasource>
<feedtype>full</feedtype>
</header>
<group>
<record url="googledb://mydb.mycompany.com/DB_SOURCE_NAME/azE9MSwwOTk4T0U3NTAwNisrKysrKysrKyZrMj0yLDA" 
action="add" mimetype="text/html" lock="true"> 
<content><![CDATA[<html>
<head>
<META http-equiv="Content-Type" content="text/html; 
charset=UTF-8"> 
<title>Default Page</title> 
</head> 
<body style="font-family: arial, helvetica, sans-serif;"> 
<H2 align="center">Database Result</H2> 
<table cellpadding="3" cellspacing="0" align="center" 
width="100%" border="0"> 
<tr> 
<th style="border-bottom: #ffcc00 1px solid;" align="left"> 
DOC_ID_COLUMN 
</th> 
<th style="border-bottom: #ffcc00 1px solid;" align="left"> 
SECOND_COLUMN_NAME 
</th> 
<th style="border-bottom: #ffcc00 1px solid;" align="left"> 
THIRD_COLUMN_NAME 
</th> 
</tr> 
<tr valign="top" bgcolor="white"> 
<td><font size="-1">2</font></td> 
<td><font size="-1">Second column content data.</font></td> 
<td><font size="-1">Third column content data.</font></td> 
</tr> 
</table> 
</body> 
</html> 
]]></content> 
</record> 
</group> 
</gsafeed> 

The following example shows how to add metadata to indexed content by customizing the <head> section in the default database stylesheet. In this example, name and content metadata is added:

<head>
<title>Default Page</title>
<xsl:for-each select="/database/table/table_rec/*">
<meta>
<xsl:attribute name="name">
<xsl:value-of select="name(.)"/>
</xsl:attribute>
<xsl:attribute name="content">
<xsl:value-of select="."/>
</xsl:attribute>
</meta>
</xsl:for-each>
</head>
Configuring Advanced Settings

The Advanced Settings section of the Database Datasource Information contains options for configuring an incremental crawl query and BLOB type and content. The following table describes the advanced settings.

 

Incremental Crawl Query

Provides a means for the appliance to update the index of database data, without having to retrieve the entire contents of an unconstrained query.

The Incremental Crawl Query requires a modified version of the Crawl Query. It must include a last_modified_time condition of the following form:

SELECT ... WHERE last_modified_time > ?

The ? will hold the last successful crawl time from the appliance. If you do not use the ? character, the query will fail.

One of the joined tables must have a modification time column. The time format used for modification times is

YYYY-MM-DD HH:MM:SS and will be in GMT. Also the column must have a date data type.

Incremental feeds and full feeds allow for deletion and addition of data. These take the following form:

SELECT ..., action WHERE last_modification_time > ?

Action Field

The Action column must specify either “add” or “delete”.

The database administrator should populate the ‘action’ column using database triggers. The ‘action’ column need not be part of the source table, but instead part of a separate logging table which is joined with the source table holding the content by means of primary keys. The database administrator should purge the logging information of all entries dated before the last successful incremental crawl.

BLOB MIME Type Field

The name of the column that contains the standard Internet MIME type values of Binary Large Objects, such as text/plain and text/html.

Database feeds do support content in BLOB columns. The MIME type information must be supplied as a column. BLOBs use Base64 binary encoding. The XSL transformation from the specified stylesheet is not applied to BLOB data, or its associated row.

BLOBs are automatically binary encoded as Base64 when it is crawled by the TableCrawler. BLOBs will display HTML snippets but their links will be to the original binary format (e.g. MS Word, PDF). The cache link for the snippet will provide an HTML representation of the binary data.

Multiple BLOBs in a single query are not supported. A CLOB can be treated as a BLOB column or as text.

The search appliance usually transforms data from crawled pages, which protects against security vulnerabilities. If you cause the search appliance to crawl BLOB content by filling in these advanced settings, certain conditions could open a vulnerability. The vulnerability exists only if both of these conditions are true:

BLOB Content Field

The name of the column that contains the BLOB content, of type described in the BLOB MIME Type Field.

Setting URL Patterns to Enable Database Crawl

When you set up a database crawl you need to include entries in the Follow Patterns fields on the Content Sources > Web Crawl > Start and Block URLs page of the Admin Console.

To include all database feeds, use the following crawl pattern:

^googledb://

To include a specific database feed, use the following crawl pattern:

^googledb://<database_host_name>/<database_source_name>/

URLs and URL patterns are case sensitive. If you use uppercase for the database source name, you must use the same uppercase in the crawl start URLs and crawl patterns.

If your data source contains a URL column with URLs that point to your own website, add those URL patterns under Follow Patterns on the Content Sources > Web Crawl > Start and Block URLs page.

For complete information about the Content Sources > Web Crawl > Start and Block URLs page, click Admin Console Help > Content Sources > Web Crawl > Start and Block URLs in the Admin Console.

For more information about URL patterns, see Constructing URL Patterns.

Back to top

Starting Database Synchronization

After you configure database crawling and serving (see Configuring Database Crawling and Serving), you can start synchronizing a database by using the Content Sources > Databases page in the Admin Console.

To synchronize a database:

1.
Click Content Sources > Databases.
2.
In the Current Databases section of the page, click the Sync link next to the database that you want to synchronize.

After you click Sync, the link label changes to Sync’ing, which indicates that the database crawl is in process. When the crawl completes, Sync’ing no longer appears. However, to see the updated status for a database synchronization, you must refresh the Content Sources > Databases page. To refresh the page, navigate to another page in the Admin Console then navigate back to the Content Sources > Databases page.

You can synchronize several databases at the same time. The resulting feeds are also processed concurrently. For any given database, the search appliance must finish synchronization before it can begin to process the database feed.

You can also use the Current Databases section of the Content Sources > Databases page to edit database data source information for a database, delete a database, or view log information for a database.

For complete information about the Current Databases section, click Admin Console Help > Content Sources > Databases in the Admin Console.

Monitoring a Feed

You can monitor the progress of a database feed by using the Content Sources > Feeds page in the Admin Console. This page shows all feed status, including the automatically-entered database feed. When a feed process successfully finishes, it is marked completed on the page.

For complete information about the Content Sources > Feeds page, click Admin Console Help > Content Sources > Feeds in the Admin Console.

For more information about feeds, refer to the Feeds Protocol Developer’s Guide.

Back to top

Troubleshooting

This section contains recommended actions for troubleshooting problems when database synchronization does not appear to result in a feed.

Verify the hostname and port

Verify that a database process is listening on the host and port that has been specified on the Content Sources > Databases page. For example, the default port for MySQL is 3306. Run the following command and verify that you get a connection. Be sure to test this from a computer on the same subnet as the appliance.

telnet <dbhostname> 3306

Verify the remote database connection

Check that you are able to connect to the database from a remote computer using the connection parameters. For example, run the following command line to connect to a remote MySQL database:

mysql -u<username> -p<password> -h<dbhostname> <database>

In this example, the MySQL client must be available on the computer used.

Check the database logs

Look at the logs on the database server to see whether there was a successful connection.

Check for a database URL in the Follow and Crawl Patterns

Ensure that a suitable follow and crawl URL pattern for the database is entered on the Content Sources > Web Crawl > Start and Block URLs page in the Admin Console. If there is no suitable pattern:

1.
^googledb://<appliance_ip_address>/db/
2.
Click the Save button to save your changes.

Check the Serve Query or Serve URL

On the Content Sources > Databases page, make sure there is either a valid entry in either Serve Query or Serve URL Field.

Check the SQL query

Make sure the SQL queries are valid given primary key substitutions for the ’?’ value. Test the query by using the SQL client for the database being used.

Verify a known-working crawl query

Verify that a known-working crawl query, which produces a small number of results, works correctly.

Check database networking

Run a tcpdump on the traffic between the appliance and the database server on the specified port when you do a sync. Compare to a tcpdump from a successful connection.

Check the feed

If the sync has completed successfully, troubleshoot the feed. For information about troubleshooting feeds, refer to the Feeds Protocol Developer’s Guide.

Back to top

Frequently Asked Questions

Q: Can multiple databases be synchronized at one time?

A: Yes.

Q: What would happen if a new database synchronization were started and completed while a previous one is still being processed by the feeder? Is the feeder restarted with the new feed or the new feed just queued up behind the one running?

A: Database feeds are processed in queued order. If a feed is being processed, a new feed of the same description is queued up behind the one running.

Q: When will the database be resynchronized?

A: A database is only synchronized fully when you click the Sync link on the Content Sources > Databases page in the Admin Console. Incremental synchronizations are a more efficient mechanism for updating the index for large query result sets which only change partially. Neither incremental nor full database syncs are scheduled automatically.

Q: Is there a way to schedule a database crawl?

A: There is currently no way to schedule a database crawl. It can be synchronized manually from the Content Sources > Databases page.

Q: Can a sync be stopped once it’s in progress?

A: There is currently no way to stop a database synchronization once it is started.

Q: How can I tell the status of the sync?

A: On the Content Sources > Databases page, the Sync link under Current Databases reads Sync’ing when the link is clicked. Go to a different page in the Admin Console and returning to the Content Sources > Databases page to refresh and update the status. After a successful database synchronization, a feed appears on the Content Sources > Feeds page.

Q: Sun defines several types of JDBC adapters. Which ones can be used?

A: The Google Search Appliance supports the Java to DB direct adapter types.

Q: What SQL is supported in the database configuration page? SQL99? ANSI SQL?

A: This is dependent on the JDBC adapters used to connect with your database.

Q: Can the list of JDBC adapters used be updated?

A: Currently the Google Search Appliance does not support modification of the JDBC adapter list.

Q: Is the database connection secure?

A: The Google Search Appliance does not provide an encrypted connection channel to the database unless such is supported by the JDBC adapter. It is recommended that this connection be hosted over a private network link and that JDBC communication is assumed to be insecure.

Q: Can the Google Search Appliance crawl the access information from the database?

A: The Google Search Appliance does not support evaluation of database ACLs.

Q: Can the Fully Qualified Domain Name be used for the crawl patterns?

A: No. Currently, the crawl patterns must specify the appliance using its IP address.

Q: Can the result of a database synchronization be translated by using a customized XSLT to reduce the number of documents counted?

A: The Google Search Appliance does not support translation of the database feed itself. Each row from the crawl query counts as a crawled document. The number of rows produced can only be reduced by optimizing the crawl query.

Q: Is there a DTD for the XML produced from the database for providing a custom stylesheet?

A: There is no DTD for the database XML, as the structure is dependent on the SELECT clause of the crawl query. An identity transformation stylesheet is available, which allows you to see the raw database XML structure, as it appears before it is transformed by a stylesheet.

To use the identity_transform.xsl and see the raw XML:

2.
Provide information about the database using the Create New Database Source section of the Content Sources > Databases page.
4.
Make sure the Serve Query and Primary Key Fields are completed.
5.
Click Create.
6.
Click Sync for the database.

Q: How does the search appliance invalidate the database crawled contents of the index for a particular collection?

A: When you click the Sync link on the Content Sources > Databases page in the Admin Console for a database source name, the old contents in the index are invalidated.