In the briefing room: SAS Content Categorization

| by basexblog | No comments

In the briefing room: SAS Content Categorization

Locating content and information in the enterprise is a considerable challenge, one that not only hampers organizational productivity but also throttles individual knowledge worker efficiency and effectiveness. Workers typically use search tools to find content and this is where their struggle begins.

There are two key problems with search technology today: 1.) such systems provide “results,” not answers, and 2.) they do not support natural language queries. In addition, typical search tools do not always understand relationships and context: Java could refer to a type of coffee, an island in Indonesia, or a programming language. Typing “Java” into the Google search engine returned results only relating to Java as a programming language for the first three pages.

Thanks to the various flaws common to most search tools, 50% of all searches fail. The good news is that those failures are obvious and recognized by the person doing the search. The bad news is that 50% of the searches people believe succeeded actually failed in some way, but this was not readily apparent to the person doing the search. As a result, that person uses information that may be out of date, not the best response for what he was looking for, or is simply incorrect. (We call this the 50/50 Rule of Search.)

The problems with search contribute greatly to the problems of Information Overloadin the enterprise.

According to research conducted by Basex in 2006 and 2007, knowledge workers spend 15% of the work day searching for content. This figure is far higher than it needs to be, and represents the time knowledge workers waste as a result of poor search tools, bad search techniques on the part of knowledge workers, and a lack of effective taxonomies.

In an age of Information Overload, where we create more content in a day than the entire population of the planet could consume in a month, more effective tools are needed. One approach towards improving search is better and more effective categorization. We recently had a look at SAS Content Categorization, one promising product in this space. Content Categorization helps to categorize information so that search engines can present relevant results faster by having the user navigate through topics/facets related to the user’s query.

SAS acquired Teragram, a natural language processing and advanced linguistic technology company, in March 2008. After integrating Teragram as a division, SAS launched Content Categorization in February 2009.

The offering enables the creation of taxonomies and category rules to parse and analyze content and create metadata that can trigger business processes. Taxonomies and category rules are created via the TK240, a desktop tool for administration and taxonomy management that is a component of SAS Content Categorization. Once a taxonomy is created, high level categories are selected, followed by narrower ones. There is no limit as to how granular the categories can go, allowing for users to drill down on topics. The system also includes prebuilt taxonomies for specific industries such as news organizations, publishers, and libraries.

Whoever is doing the setup – and SAS Content Categorization is designed for use by non-technical users – can develop category rules from within the TK240 as well. The rules may consist of multiple keywords, based on the percentage appearing in a document, as well as weighted keywords that give more value to certain words than others. Additionally, it is possible to apply Boolean operators, so, for example, to meet the rule Java and programming must appear in the same sentence, while Java and coffee appearing in the same sentence would not meet the rule. Rules can be created for extremely specific situations, such as the presence of URLs, grammatical instances, or the presence of suffixes (Inc., Corp., AG., etc.).

The system is also equipped with options for setting role-based permissions to allow users to read/write, and enable multiple users to collaborate on developing taxonomies. This allows multiple taxonomists to have secure access to projects, with individual levels of read/write access to category rules and concept definitions.

SAS Content Categorization can be an effective weapon against Information Overload by allowing the creation of complex automated systems to categorize content, increasing the likelihood of the knowledge workers being able to find what they are looking for in a timely manner. In addition, increasing the relevance of search results by using taxonomies to provide context raises the value of content that is found, decreasing the likelihood of knowledge workers moving forward with second-best or faulty information.

Companies looking to take decisive measure to lower Information Overload should carefully review their current search tools and, where appropriate, give serious consideration to SAS Content Categorization.