Wednesday, July 22, 2009

Enterprise Search Architecture & Configuration - Part 1



Enterprise Search Architecture
Following figure shows internal architecture of Enterprise search



Below is the brief description of components involved in this architecture
  • Protocol Handlers
    • Opens content sources in their native protocols and exposes documents and other items to be filtered.
    • Connect to and traverse content sources over a given protocol.
    • Identity Content, invoke iFilters, retrieve system level metadata, and return content and metadata streams to the index engine.
    • Protocol handler available OOTB-
      • Web protocol handler
      • Sharepoint protocol handler
      • File protocol handler
      • Exchange public folder protocol handler
      • Lotus Notes Protocol handler


  • IFilters
    • Opens documents and other content source items in their native formats
    • Filters into chunks of text and properties.
    • Filter out embedded formatting and retrieves content and properties.
    • iFilters included in MOSS 2007 -


  • Wordbreakers - Used by the query and index engines to break compound words and phrases into individual words or tokens.
    • wb in the indexing process - identify breaking characters, such as white spaces and punctuation and then identify wordsto be indexed

    • Language-specific word breakers and compound words
    • word breakers at query time

  • Stemmers
    • Inflectional forms: Nouns and Verbs
    • Stemmers in the indexing process - Morphological analysis
    • Stemmers at query time - Morphological Generation
    • Language specific stemmers


  • Content Index - Stores information about words and their location in a content item.

  • Property Store - Stores a table of properties and associated values.

  • Search Configuration Data - Stores information used by the Search service, including crawl configuration, property schema, scopes, and so on
    • Content Sources - A content source is a collection of start addresses representing content that should be crawled by the search index component. A content source also specifies settings that define the crawl behavior and the schedule on which the content will be crawled. Following are the content source types included in Enterprise Search:
      • SharePoint content
      • Web content
      • File share content
      • Exchange folder content
      • Business data content
        If you need to include other types of content, you can create a custom content source and protocol handler for Enterprise Search.
    • Crawl Log - The crawl log tracks information about the status of crawled content, and contains the current status of every item in the content index.
    • Search Scopes -Search scopes are a collection of items grouped together based on a common element among the items within that scope, which help users broaden or narrow the scope of their searches. Search scopes available at the SSP level are called shared scopes. Search scopes are also available at the site level. Search scopes created at the site level are only visible to the site they were created in, and to subsites within the top-level site.
    • Keywords and Best Bets - Keywords are words or phrases that site administrators have identified as important. They provide a way to display additional information and recommended links on the initial results page that may not otherwise appear in the search results for a particular word or phrase. Best bets are a list of resources recommended by the search administrator for a keyword. There is a many-to-many relationship between keywords and best bets. A keyword will likely have more than one best bet associated with it, and a best bet can be associated with multiple keywords. This means the best bets list is configured separately from the keywords list. After a best bet is added to this list, it can be associated with the appropriate entries in the Keywords list.
    • Search schema - The Enterprise Search schema is comprised of two types of properties, crawled properties and managed properties, as well as the mappings between the two sets of properties.The index engine extracts crawled properties from content items when crawling content. These properties are grouped into different property categories based on the protocol handler and Ifilter used. Managed properties are the set of properties that are part of the search user experience, so to include a crawled property value in search functionality, it must be mapped to a managed property in the Document property mappings. Managed properties are created and managed at the SSP level.

  • Index Engine - Processes the chunks of text and properties filtered from content sources, storing them in the content index and property store.
    • Master and Shadow Indexes
    • Continuous Propagation from Index Server to Query – Between 3 and 30 seconds for an indexed document to be searchable
  • Query Engine - Executes keyword and SQL syntax queries against the content index and search configuration data.



Source - http://msdn.microsoft.com/en-us/library/ms570748.aspx


Part 2 - Search processes and server role involved in Search Architecture >>

No comments: