Digerati: Information Retrieval

Information Retrieval :

Process of finding documents of un-structured nature (usually converted to text ) which satisfies information need from large collection .

Example :

Email search
Desktop search
Legal Search
Search knowledge base
Web search - Google/Bing etc.

Unstructured data is increasing day by day compared structured data .

Information retrieval could be on static or dynamic collection .

Normal Flow for static collection

1. Feed collection of documents to search engine
2. Frame query for information need
3. Analyze results
4. Refine the query (adding removing some terms , adding correct operators ) if results are not as per expectations
5. Do search with refined query continue till you get what you want .

How to compare two information retrieval engines ?

Precision : Fraction of retrieved documents which are relevant to user 's information need.
If we get 10 documents as result only 2 are relevant to user the precision is 2/10.

Recall : Fraction of relevant documents in collection that are retrieved .
In collection there are actual 20 documents which are relevant but query retrieves 10 documents out of which only 2 are relevant then recall is 2/20.

Following things can affect Precision and Recall

1. Query formulation
2. Underlying search engine logic while indexing and searching .

Techniques

Linear Scan /Exhaustive search

Search all documents line by line for query say using grep in linux.
Above approach is time consuming and might be need multiple scans .
Can not process complex query like proximity , ranked search results.

it is time and memory in-efficient .

Core technique is to build inverted index .

Document -Term incident metrics

Example :

We have say 4 documents with all containing 4 distinct terms.

----------------------------------------------------------
terms doc1 doc2 doc3 doc4
----------------------------------------------------------
antony 0 1 0 1

brutus 1 1 1 0

casear 1 1 1 1
----------------------------------------------------------

1 represents that terms is in document while zero represents that terms is not present.

terms are total distinct terms in all documents .

Not possible to build metrics for even say 1 million documents .

For large information this metrics is sparse .
Instead of storing documents ids (postings ) which represents unmatched comments we should store only document ids (postings ) for matched documents .

Query Processing

Logical Operators like AND

Example

------------------------------------------------------
| term | postings |
| | (already sorted) |
-------------------------------------------------------
| brutus | 1 ,4,6,8,10 |
| caesar | 3,6,8,11,12 |
-------------------------------------------------------

Say you query is "brutus AND caesar "

Then you need to merge postings are two terms .
Result will be 6,8 .

You can keep positing list sorted and do merge using merge sort sub-routing .
You need to do N way merge in memory by loading batch of posting one by one as per need.

Phrase search

Say you want to search "San Francisco " then we need to search entry which has "San" and "Francisco" placed next to each other in order .

One technique could be bi-gram . Index two words as one word "San Francisco" what you will do for tri-word and quad-wrods ?

Biwords index is not standard solution for phrase search .

We need to store positions for each posting in postings list .

Say we hve

term   postings (already sorted)

San 1 ,4,6,8,10
Francisco 4,8,22

Now we will have positional index for each posting

for word "San" we have

-------------------------------------------------------
| posting |   offsets |
-------------------------------------------------------
| 1 | 7,8,10,100 |
| 4 |    1 |
| 6 | 4,8 |
| 8 | 7,9,10 |
| 10 | 11,12 |
-------------------------------------------------------

for word "Francisco" we have

-------------------------------------------------------
| posting |   offsets |
-------------------------------------------------------
| 4 | 2.6,9 |
| 8 | 15 ,60   |
| 22 | 1,2,3 |
-------------------------------------------------------

For phrase search "San Francisco" we will have result posting 4 because San is at offset position 1 and Francisco is at offset position 2 so we have these two terms next to one another.

Now while merging postings you have to consider offsets also .
Because of this merge will be come complex and non-trivial problem.

Now you can also search for proximity search means with number of words between query words.

Positional index are costly it will increase index by approx. 30% by size while it will waste 25 % of search time .
So we have to use use mixed techniques like common queries use bi-words example "tale of two cities " index this as one words and also as bi-word.
For rare words we can use positional index .
So mix of strategy could help . This could also done based on search statistics like if some phrase is searched again and again then why to use positional index lets index whole term as it is .

Boolean search vs ranked search

Above all discussion is related to Boolean search which can answer yes/no type question but what about relevancy ? means we want to get most relevant documents first then less relevant .

For Boolean search we might get too few results with ANDs and too many with ORs which is like feast or famine .

So really art of forming perfect query lies with user . So user need to be more expert on it and we do not want to increase learning curve of user .
So ranked search is for rescue so that user always get top relevant /scored document first .
Ranked retrieval is essentially search results sorted by some kind of relevancy score generally it is between 0 to 1 .

Ranked retrieval are generally used for free text search . but for well defined fields with definite set of values Boolean search is better.

Ranked Search - Scoring Models

Jaccard Co-efficent

Commonly used to measure extend of overlap of two sets A & B .

Jaccard(A,B) = |A ∩ B | / |A U B |
Jaccard(A,A) = |A ∩ A | / |A U A | = A/A = 1

if A ∩ B = 0 then
Jaccard(A,B) = 0

A and B may not have same size . So it might skew it for larger set .

Example

Query , Q = "ides of march "
Document , D1 = "caesar died in march"
Document D2 = "the long march"
Document D2 = "the ides march "

Jaccard(Q,D1) = | Q ∩ D1 | / | Q U D1 | = 1/6 = 0.16

Jaccard(Q,D2) = | Q ∩ D2 | / | Q U D2| = 1/ 5 = 0.2

Jaccard(Q,D3) = | Q ∩ D3 | / | Q U D3| = 2/ 4 = 0.5

Observer document is smaller number of word wins .

Jaccard's scoring problem

1. Does not consider terms frequency , generally document with more frequency of words and which are globally in other document rare are more relevant .

2. Need normalization for document length because shorter document always wins than longer document .

Terms frequency weighting

We can create Document -Term incident metrics as actual number of frequencies than zero one based vector.

----------------------------------------------------------
| terms doc1 doc2 doc3 doc4 |
----------------------------------------------------------
antony 0 10 0 12

brutus 11 100 50 11

casear 0 0 0 1 1000
----------------------------------------------------------

But if documents contain same words with combination variation then both documents will have same score .
Example :

Document D1 = Jane is quicker than John
Document D2 = John is quicker than Jane

Document D1 and D2 have same frequency vector.

We need to have positional information for phrase search .

Term Frequency (tf -t,d)

tf -t,d - term t occurring in document d.

Number of time the term occurs in one document .

So document with more more frequent terms is relevant than document with terms has low frequency. But that is not always case . If words is so common in whole collection then importance of that words words should reduce . So we need to dampen globally occurring words but enhance locally frequent word.

We can dampen the total value by taking log of tf-t,d.

w-t,d = 1 + log(tf -t,d) if tf-t,d > 0

w-t,d = 0 if  tf-t,d ==0

Score = summation of all w-t,d for all terms in query .

Score = ∑ t ∈ q ∩ d  w-t,d

Inverse Document Frequency

In search if documents have terms from query which are rare across collection but frequent in the particular document then score of such document need to be higher because generally rare terms are more informative than frequent one .
For example - frequent words like hello , hi , all should be scored low.

df -t = number of document which contains the term t . Repeated occurrence is not counted in same document.

df-t <= N

N = total number of documents in the collection

idf = inverse document frequency

idf-t = log (N/df-t)

Here log function is used to dampen the value .
idf value is between 0 to logN

Example

N = 1000000 = 1 million

---------------------------------------------------------------------------------
term df-t idf-t
---------------------------------------------------------------------------------

calpurnia 1 log(1000000/1) = 6

animal 100 4

sunday 1000 3

fly 10000 2

the 1000000 0

-----------------------------------------------------------------------------------

tf-idf weighting

Wt,d = (1+log(tf-t,d)) * log (N/df-t)

tf-idf is multiplication of term frequency multiply by inverse document frequency .

Vector Space Model

Terms as axes while documents can be seen as points/vectors with wrights .
This N dimensional system with space vectors .

Query and documents are vectors in same space we need to find document vectors which are closer to query vectors means more similar documents to query .

Here say we have three documents with terms containing gossip or jealous .
d1 is more about gossip
d3 is more about jealous
d2 is more about both

Query is expecting two word at same weight then document is more relevant .

Here document d2 is more relevant for that measuring Euclidean distance is not good idea we need to measure angle between unit vectors.

Digerati

Friday, October 28, 2016

Information Retrieval - Notes

No comments:

Nilesh Salpe