Apple patent is for multi-language document search, retrieval system
TweetFollow Us on Twitter

Apple patent is for multi-language document search, retrieval system

According to Apple, a multi-lingual indexing and search system is presented that performs tokenization and stemming in a manner which is independent of whether index entries and search terms appear as words in a dictionary. The system includes a tokenizer that separates a string of text into individual word tokens, and eliminates predetermined types of tokens from further processing. The system also includes a stemmer that reduces words to grammatical stems by removing known word-endings associated with the various languages to be supported. The stemmer removes known word endings from the word tokens without any effort to guarantee that the remaining stem is contained in a dictionary. In an embodiment, the stemmer only removes those word endings which are associated with nouns. The system further includes an indexer that stores the stems in an index.

Here's Apple's background on the invention: "With the increasing amount of information that is available to users via today's computer systems, efficient techniques for locating information of interest are becoming essential. To expedite the process of searching and retrieving relevant information, it is a common practice to create an index of the searchable information that is available from various sources. For instance, if a collection of documents are to be searched for information, the documents are first examined to identify terms of interest, and an index is created which associates each term with the document(s) in which it appears. Thereafter, when a user constructs a search query, the terms in that query are examined against the entries in the index, to locate the documents containing the requested terms.

"Many search engines process the search results to calculate the relevance of each identified document to the query. For instance, a score can be calculated for each document, using a statistical technique that accounts for the number of query terms that are matched in the document, the frequency of each of those terms in the index, the frequency of each term in the document compared to the total number of terms in the document, and the like. Based upon these scores, the documents are displayed to the user in order of their relevance to the query. By means of such an approach, the query does not have to be a precisely constructed formula for finding only those documents which exactly match the terms of the query. Rather, it can be a list of words, or a natural language sentence.

"Before a string of text from a document or other source of information can be indexed, it must be parsed into individual words. Preferably, the separated words are further processed to expedite the search and retrieval function. The process of separating a text string into individual words is known as tokenization. As a first step, the text is parsed into word tokens. A word token may or may not be a recognized word, i.e., a word which appears in a dictionary. After the word tokens have been identified, they are processed to eliminate those which do not serve as useful search terms.

A further process that can be carried out prior to indexing is known as 'stemming.' In essence, stemming is the reduction of words to their grammatical stems. This process serves two primary purposes. First, it helps to reduce the size of the index, since all forms of a word are reduced to a single stem, and therefore require only one entry in the index. Second, retrieval is improved, since a query which uses one form of a word will find documents containing all of the different forms.

"Ideally, the stemming processing is applied to all words that take different forms, and accounts for every possible form of each word. In this type of approach, stemming is highly language dependent. In the past, therefore, information search and retrieval systems which employed stemming were designed for a specific language. In particular, the rules that were used to reduce each word to its grammatical stem would typically apply to only one language, and could not be employed in connection with other languages. Consequently, a different search and retrieval mechanism had to be provided for each different language that might be encountered in the documents to be searched.

"With the widespread accessibility of various information sources that is provided by today's computing environments, particularly when coupled with worldwide telecommunications facilities, such as the internet, any given source of information might contain documents in multiple different languages. Furthermore, it is not uncommon for a single document to contain text in more than one language. In these type of environments, it would be impractical to have to identify the language of a document, and then employ a different search and retrieval system for each different language that might be encountered. It is an objective of the present invention, therefore, to provide a mechanism for indexing and searching textual content which is generic to a plurality of different languages."

And here's Apple's summary of the invention: "In accordance with the present invention, a multi-lingual indexing and search system performs tokenization and stemming in a manner which is independent of whether index entries and search terms appear as words in a dictionary for a given language. During the tokenization phase of the process, a string of text is separated into individual word tokens. Predetermined types of tokens, known as junk tokens and stop words, are eliminated from further processing. As a further step, characters with diacritical marks are converted into corresponding unmarked lower case letters, to eliminate match errors that might result from incorrectly accented words.

"The stemming phase of the process reduces words to grammatical stems by removing known word-endings associated with the various languages to be supported. To expedite the stemming process, as well as expand subsequent retrieval, the stemming process is not directed to finding the true grammatical root form of a word. Rather, a known word ending is removed without any effort to guarantee that the remaining stem actually appears in a dictionary. For instance, a vowel change that normally occurs within a word, as a result of the addition of an ending, is ignored during the stemming process.

"As a further feature, the stemming process is limited to word endings that are associated with nouns. This aspect of the invention is based on the assumption that nouns are much more significant than verbs, in terms of informational content in a query. Consequently, the major processing effort is directed to nouns.

"By means of these techniques, a uniform approach is provided for the tokenization and stemming of words across a variety of languages. Consequently, the search and retrieval engine can identify documents that may be relevant to the user's query, regardless of the particular language(s) appearing in a given document."

The inventors are Wayne Loofbourrow and David Casseres. The graphic below is a schematic diagram of a computer system in which an information retrieval system can be employed for different purposes.

image

 
AAPL
$493.17
Apple Inc.
+0.00
GOOG
$611.46
Google Inc.
+0.00
MSFT
$30.77
Microsoft Corpora
+0.00
MacNews Search:
Community Search:

Decide Where To Eat With Hngry
On Twitter, it’s a dilemma that would be referred to as a ‘first world problem’ but it is sometimes difficult to decide which restaurant to go to for a meal. So many choices are out there and when it’s a decision that has to be made between many friends, things can get tricky. Enter Hngry, an app that may lack an ‘u’ but certainly doesn’t lack... | Read more »
Writing Kit Review
Writing Kit Review By Carter Dotson on February 10th, 2012 Our Rating: :: VALUABLE TOOLUniversal App - Designed for iPhone and iPad Writing Kit is a text editor that focuses on writing and editing text in markdown format.   | Read more »
Ragdoll Blaster 3 Review
Ragdoll Blaster 3 Review By Carter Dotson on February 9th, 2012 Our Rating: :: A BLASTiPhone App - Designed for the iPhone, compatible with the iPad Ragdoll Blaster 3 is the newest, most colorful entry in the Ragdoll Blaster franchise from Backflip Studios.   | Read more »
Call of Cthulhu: The Wasted Land Review
Call of Cthulhu: The Wasted Land Review By Rob Thomas on February 9th, 2012 Our Rating: :: ANSWER THE CALLUniversal App - Designed for iPhone and iPad Battle foul cultists and star-spawned evils amid the gas-soaked trenches of World War I in this turn-based strategy game, courtesy of Red Wasp Design.   | Read more »
Tweetbot for iPad Review
Tweetbot for iPad Review By Carter Dotson on February 9th, 2012 Our Rating: :: WELL-BUILT MACHINEiPad Only App - Designed for the iPad Tweetbot for iPad is a Twitter client, created by design-conscious iOS developer Tapbots.   | Read more »
Tic Tac Viewr is a Minty Fresh Augmented...
Smartphones are changing how we see the world. With the cameras on them becoming more and more powerful, phones are now able to see our reality and present it back to us in an augmented form. It may ultimately just be an advertisement for their “Shake it Up” campaign, but Tic Tac mints’ new Tic Tic Viewr app still shows off how novel augmented... | Read more »
Huntville Review
Huntville Review By Kevin Stout on February 9th, 2012 Our Rating: :: GREAT MULTIPLAYERiPad Only App - Designed for the iPad While Huntville may not be the most enjoyable game, it has some awesome features that can’t be overlooked.   | Read more »
The 60beat GamePad Adds Support for More...
One of the concerns with 60beat’s GamePad has been whether the device would see enough support from developers to make it worthwhile. As promised by 60beat back when it was announced, February has rolled around and some titles are beginning to support the 60beat. | Read more »
Unstoppable Gorg Review
Unstoppable Gorg Review By Kevin Stout on February 9th, 2012 Our Rating: :: UNFORGETTABLEiPad Only App - Designed for the iPad Unstoppable Gorg is an unforgettable tower defense for the iPad with a 50s sci-fi theme and some incredibly unique gameplay elements.   | Read more »
Pirates of Black Cove: Sink ‘Em All Will...
Nitro Games have announced an iOS spinoff of their pirate-themed PC game Pirates of Black Cove, Sink ‘Em All – and they have an entertaining trailer to go along with it. This will take the ship combat gameplay of the PC version and make it the featured element, essentially a naval isometric shooter with cannons to fire, and booty to plunder. Well... | Read more »
All contents are Copyright 1984-2010 by Xplain Corporation. All rights reserved. Theme designed by Icreon.