Information Retrieval :
Process of finding documents of un-structured nature (usually converted to text ) which satisfies information need from large collection .
Example :
- Email search
- Desktop search
- Legal Search
- Search knowledge base
- Web search - Google/Bing etc.
Information retrieval could be on static or dynamic collection .
Normal Flow for static collection
1. Feed collection of documents to search engine
2. Frame query for information need
3. Analyze results
4. Refine the query (adding removing some terms , adding correct operators ) if results are not as per expectations
5. Do search with refined query continue till you get what you want .
How to compare two information retrieval engines ?
Precision : Fraction of retrieved documents which are relevant to user 's information need.
If we get 10 documents as result only 2 are relevant to user the precision is 2/10.
Recall : Fraction of relevant documents in collection that are retrieved .
In collection there are actual 20 documents which are relevant but query retrieves 10 documents out of which only 2 are relevant then recall is 2/20.
Following things can affect Precision and Recall
1. Query formulation
2. Underlying search engine logic while indexing and searching .
Techniques
Linear Scan /Exhaustive search
Search all documents line by line for query say using grep in linux.
Above approach is time consuming and might be need multiple scans .
Can not process complex query like proximity , ranked search results.
it is time and memory in-efficient .
Core technique is to build inverted index .
Document -Term incident metrics
Example :
We have say 4 documents with all containing 4 distinct terms.
----------------------------------------------------------
terms doc1 doc2 doc3 doc4
----------------------------------------------------------
antony 0 1 0 1
brutus 1 1 1 0
casear 1 1 1 1
----------------------------------------------------------
1 represents that terms is in document while zero represents that terms is not present.
terms are total distinct terms in all documents .
Not possible to build metrics for even say 1 million documents .
For large information this metrics is sparse .
Instead of storing documents ids (postings ) which represents unmatched comments we should store only document ids (postings ) for matched documents .
Query Processing
Logical Operators like AND
Example
------------------------------------------------------
| term | postings |
| | (already sorted) |
-------------------------------------------------------
| brutus | 1 ,4,6,8,10 |
| caesar | 3,6,8,11,12 |
-------------------------------------------------------
Say you query is "brutus AND caesar "
Then you need to merge postings are two terms .
Result will be 6,8 .
You can keep positing list sorted and do merge using merge sort sub-routing .
You need to do N way merge in memory by loading batch of posting one by one as per need.
Phrase search
Say you want to search "San Francisco " then we need to search entry which has "San" and "Francisco" placed next to each other in order .
One technique could be bi-gram . Index two words as one word "San Francisco" what you will do for tri-word and quad-wrods ?
Biwords index is not standard solution for phrase search .
We need to store positions for each posting in postings list .
Say we hve
term postings (already sorted)
San 1 ,4,6,8,10
Francisco 4,8,22
Now we will have positional index for each posting
for word "San" we have
-------------------------------------------------------
| posting | offsets |
-------------------------------------------------------
| 1 | 7,8,10,100 |
| 4 | 1 |
| 6 | 4,8 |
| 8 | 7,9,10 |
| 10 | 11,12 |
-------------------------------------------------------
for word "Francisco" we have
-------------------------------------------------------
| posting | offsets |
-------------------------------------------------------
| 4 | 2.6,9 |
| 8 | 15 ,60 |
| 22 | 1,2,3 |
-------------------------------------------------------
For phrase search "San Francisco" we will have result posting 4 because San is at offset position 1 and Francisco is at offset position 2 so we have these two terms next to one another.
Now while merging postings you have to consider offsets also .
Because of this merge will be come complex and non-trivial problem.
Now you can also search for proximity search means with number of words between query words.
Positional index are costly it will increase index by approx. 30% by size while it will waste 25 % of search time .
So we have to use use mixed techniques like common queries use bi-words example "tale of two cities " index this as one words and also as bi-word.
For rare words we can use positional index .
So mix of strategy could help . This could also done based on search statistics like if some phrase is searched again and again then why to use positional index lets index whole term as it is .
Boolean search vs ranked search
Above all discussion is related to Boolean search which can answer yes/no type question but what about relevancy ? means we want to get most relevant documents first then less relevant .
For Boolean search we might get too few results with ANDs and too many with ORs which is like feast or famine .
So really art of forming perfect query lies with user . So user need to be more expert on it and we do not want to increase learning curve of user .
So ranked search is for rescue so that user always get top relevant /scored document first .
Ranked retrieval is essentially search results sorted by some kind of relevancy score generally it is between 0 to 1 .
Ranked retrieval are generally used for free text search . but for well defined fields with definite set of values Boolean search is better.
Ranked Search - Scoring Models
Jaccard Co-efficent
Commonly used to measure extend of overlap of two sets A & B .
Jaccard(A,B) = |A ∩ B | / |A U B |
Jaccard(A,A) = |A ∩ A | / |A U A | = A/A = 1
if A ∩ B = 0 then
Jaccard(A,B) = 0
A and B may not have same size . So it might skew it for larger set .
Example
Query , Q = "ides of march "
Document , D1 = "caesar died in march"
Document D2 = "the long march"
Document D2 = "the ides march "
Jaccard(Q,D1) = | Q ∩ D1 | / | Q U D1 | = 1/6 = 0.16
Jaccard(Q,D2) = | Q ∩ D2 | / | Q U D2| = 1/ 5 = 0.2
Jaccard(Q,D3) = | Q ∩ D3 | / | Q U D3| = 2/ 4 = 0.5
Observer document is smaller number of word wins .
Jaccard's scoring problem
1. Does not consider terms frequency , generally document with more frequency of words and which are globally in other document rare are more relevant .
2. Need normalization for document length because shorter document always wins than longer document .
Terms frequency weighting
We can create Document -Term incident metrics as actual number of frequencies than zero one based vector.
----------------------------------------------------------
| terms doc1 doc2 doc3 doc4 |
----------------------------------------------------------
antony 0 10 0 12
brutus 11 100 50 11
casear 0 0 0 1 1000
----------------------------------------------------------
But if documents contain same words with combination variation then both documents will have same score .
Example :
Document D1 = Jane is quicker than John
Document D2 = John is quicker than Jane
Document D1 and D2 have same frequency vector.
We need to have positional information for phrase search .
Term Frequency (tf -t,d)
tf -t,d - term t occurring in document d.
Number of time the term occurs in one document .
So document with more more frequent terms is relevant than document with terms has low frequency. But that is not always case . If words is so common in whole collection then importance of that words words should reduce . So we need to dampen globally occurring words but enhance locally frequent word.
We can dampen the total value by taking log of tf-t,d.
w-t,d = 1 + log(tf -t,d) if tf-t,d > 0
w-t,d = 0 if tf-t,d ==0
Score = summation of all w-t,d for all terms in query .
Score = ∑ t ∈ q ∩ d w-t,d
Inverse Document Frequency
In search if documents have terms from query which are rare across collection but frequent in the particular document then score of such document need to be higher because generally rare terms are more informative than frequent one .
For example - frequent words like hello , hi , all should be scored low.
df -t = number of document which contains the term t . Repeated occurrence is not counted in same document.
df-t <= N
N = total number of documents in the collection
idf = inverse document frequency
idf-t = log (N/df-t)
Here log function is used to dampen the value .
idf value is between 0 to logN
Example
N = 1000000 = 1 million
---------------------------------------------------------------------------------
term df-t idf-t
---------------------------------------------------------------------------------
calpurnia 1 log(1000000/1) = 6
animal 100 4
sunday 1000 3
fly 10000 2
the 1000000 0
-----------------------------------------------------------------------------------
tf-idf weighting
Wt,d = (1+log(tf-t,d)) * log (N/df-t)
tf-idf is multiplication of term frequency multiply by inverse document frequency .
Vector Space Model
Terms as axes while documents can be seen as points/vectors with wrights .
This N dimensional system with space vectors .
Query and documents are vectors in same space we need to find document vectors which are closer to query vectors means more similar documents to query .
Here say we have three documents with terms containing gossip or jealous .
d1 is more about gossip
d3 is more about jealous
d2 is more about both
Query is expecting two word at same weight then document is more relevant .
Here document d2 is more relevant for that measuring Euclidean distance is not good idea we need to measure angle between unit vectors.
No comments:
Post a Comment