Crawling an Individual Document Library
Have you ever wanted to crawl a specific document library in a SharePoint site? Have you tried and instead you successfully crawled the entire portal or site collection? I've been told that this can be accomplished by crawling the library by its UNC path; however, that still doesn't work.
A customer I work with had a technical requirement to have multiple portal implementations across the US. Each portal implementation had a specific document library that contained documents and metadata. The main portal in this scenario was responsible for crawling the individual document libraries in each portal implementation; however, we always ran into issues with the crawler "jumping outside" of the intended crawl scope. Sure, we could add include/exclude paths until our eyes bled, but that process never really seems to work as one would expect. We opened a ticket with Microsoft and we're presented with an approach that actually works.
1. Identify the underlying document library's "site" or "area" and use the crawl logs to find the URL that SharePoint used to crawl the content. This is a painful process; however, it can be made easier by searching the gatherer logs that are stored in the portal's underlying _Serv database. The URL you should look for will have the form: sts2://<servername>/webid=000/listid={listid}. To this date I have no clue where the web id comes from since web ids are typically GUIDs. The list id on the other hand is a GUID that can be easy to ascertain by looking at querystrings on the portal site. Regardless, it is much easier to find this URL by querying the gather log tables in the _Serv database.
2. Once the exact URL is identified, you can add an Exchange Public Folder content source that points to aforementioned URL. Configure the content source to crawl as desired and start the crawl. Assuming the crawl account you are using has access to the SharePoint site, you're in business.
By following these two steps, you can crawl individual document libraries and/or lists in SharePoint. This is very powerful for content aggregation across an enterprise that has disparate stores for documents. Assuming the documents have like metadata, an advanced search scenario makes this even more interesting. Since each underlying URL that was identified in 1 and 2 above is a content source, you can create a scope that includes each content source. With some custom programming, a SharePoint developer can create an interface for business users to choose which content source or sources they want to search as well as providing search inputs to search for documents by metadata in an advanced search.
An example of this scenario is as follows: Joe User wants to find all documents in Portal A's library, Portal C's library, but not Portal B's library with department = HR and document type = specification and a free text search for documents containing the word SharePoint.
IMO, this is a powerful customization.