Unleashing SharePoint's Potential for the Real World

Crawling an Individual Document Library

Have you ever wanted to crawl a specific document library in a SharePoint site?  Have you tried and instead you successfully crawled the entire portal or site collection?  I've been told that this can be accomplished by crawling the library by its UNC path; however, that still doesn't work.

A customer I work with had a technical requirement to have multiple portal implementations across the US.  Each portal implementation had a specific document library that contained documents and metadata.  The main portal in this scenario was responsible for crawling the individual document libraries in each portal implementation; however, we always ran into issues with the crawler "jumping outside" of the intended crawl scope.  Sure, we could add include/exclude paths until our eyes bled, but that process never really seems to work as one would expect.  We opened a ticket with Microsoft and we're presented with an approach that actually works. 

1.  Identify the underlying document library's "site" or "area" and use the crawl logs to find the URL that SharePoint used to crawl the content.  This is a painful process; however, it can be made easier by searching the gatherer logs that are stored in the portal's underlying _Serv database.  The URL you should look for will have the form: sts2://<servername>/webid=000/listid={listid}.  To this date I have no clue where the web id comes from since web ids are typically GUIDs.  The list id on the other hand is a GUID that can be easy to ascertain by looking at querystrings on the portal site.  Regardless, it is much easier to find this URL by querying the gather log tables in the _Serv database.

2.  Once the exact URL is identified, you can add an Exchange Public Folder content source that points to aforementioned URL.  Configure the content source to crawl as desired and start the crawl.  Assuming the crawl account you are using has access to the SharePoint site, you're in business.

By following these two steps, you can crawl individual document libraries and/or lists in SharePoint.  This is very powerful for content aggregation across an enterprise that has disparate stores for documents.  Assuming the documents have like metadata, an advanced search scenario makes this even more interesting.  Since each underlying URL that was identified in 1 and 2 above is a content source, you can create a scope that includes each content source.  With some custom programming, a SharePoint developer can create an interface for business users to choose which content source or sources they want to search as well as providing search inputs to search for documents by metadata in an advanced search.

An example of this scenario is as follows:  Joe User wants to find all documents in Portal A's library, Portal C's library, but not Portal B's library with department = HR and document type = specification and a free text search for documents containing the word SharePoint.

IMO, this is a powerful customization.

Posted: 04-29-2006 11:17 AM by emau | with 7 comment(s)
Filed under:

Comments

Rajesh said:

Your article is really good...this is what i was looking for since long time...

I am not clear with first step of indentifying URL of list from gatherer log...could you please give some more detail? We are using MOSS 2007.

Thanks a lot for this article.

# November 12, 2008 9:15 AM

emau said:

This article is more relevant to SPS 2003 than 2007.  You will actually have much better control in SharePoint 2007 to accomplish this.

# November 12, 2008 6:13 PM

Dan said:

Do you have instructions for MOSS 2007 to crawl a specific Document library and only that library?

# January 31, 2009 6:26 PM

emau said:

Dan - I'll see if I can do a test of this scenario using MOSS 2007.  I'm *assuming* it is a lot easier with MOSS.  I haven't been presented with the challenge since I had to do it for 2003.  I'll post an update with my findings.  I'm assuming the your scenario is for a separate portal not in the same farm?

# February 9, 2009 10:16 PM

Willy said:

Hi,

To crawl specific Library in MOSS 2007 you just need to set the rules, i.e:

(Include) Folder = servername/.../libraryname

(Exclude) ContentType = Folder Exclude

(Exclude) ContentType = text/html; charset=utf-8 Exclude

# February 24, 2009 3:06 AM

Willy said:

Correction:

(Include) Folder = http: //servername/.../libraryname (remove space between http and servername)

# February 24, 2009 3:09 AM

sundarbalu said:

this is my accessurl of the document library

sts3://moss25:14/siteurl=/siteid={4ebe83bb-a4ee-4dce-ac60-9ec80d96389b}/weburl=/webid={bd167bca-1e1c-41b8-9635-8c6fc29b5ab7}/listid={73798bac-9d4d-40e7-8934-7ede14215154}/

when i trying to give this address

i got error as the url is not valid

# May 16, 2009 1:53 AM