Digerati: October 2016

Friday, October 28, 2016

Information Retrieval - Notes

Information Retrieval :

Process of finding documents of un-structured nature (usually converted to text ) which satisfies information need from large collection .

Example :

Email search
Desktop search
Legal Search
Search knowledge base
Web search - Google/Bing etc.

Unstructured data is increasing day by day compared structured data .

Information retrieval could be on static or dynamic collection .

Normal Flow for static collection

1. Feed collection of documents to search engine
2. Frame query for information need
3. Analyze results
4. Refine the query (adding removing some terms , adding correct operators ) if results are not as per expectations
5. Do search with refined query continue till you get what you want .

How to compare two information retrieval engines ?

Precision : Fraction of retrieved documents which are relevant to user 's information need.
If we get 10 documents as result only 2 are relevant to user the precision is 2/10.

Recall : Fraction of relevant documents in collection that are retrieved .
In collection there are actual 20 documents which are relevant but query retrieves 10 documents out of which only 2 are relevant then recall is 2/20.

Following things can affect Precision and Recall

1. Query formulation
2. Underlying search engine logic while indexing and searching .

Techniques

Linear Scan /Exhaustive search

Search all documents line by line for query say using grep in linux.
Above approach is time consuming and might be need multiple scans .
Can not process complex query like proximity , ranked search results.

it is time and memory in-efficient .

Core technique is to build inverted index .

Document -Term incident metrics

Example :

We have say 4 documents with all containing 4 distinct terms.

----------------------------------------------------------
terms doc1 doc2 doc3 doc4
----------------------------------------------------------
antony 0 1 0 1

brutus 1 1 1 0

casear 1 1 1 1
----------------------------------------------------------

1 represents that terms is in document while zero represents that terms is not present.

terms are total distinct terms in all documents .

Not possible to build metrics for even say 1 million documents .

For large information this metrics is sparse .
Instead of storing documents ids (postings ) which represents unmatched comments we should store only document ids (postings ) for matched documents .

Query Processing

Logical Operators like AND

Example

------------------------------------------------------
| term | postings |
| | (already sorted) |
-------------------------------------------------------
| brutus | 1 ,4,6,8,10 |
| caesar | 3,6,8,11,12 |
-------------------------------------------------------

Say you query is "brutus AND caesar "

Then you need to merge postings are two terms .
Result will be 6,8 .

You can keep positing list sorted and do merge using merge sort sub-routing .
You need to do N way merge in memory by loading batch of posting one by one as per need.

Phrase search

Say you want to search "San Francisco " then we need to search entry which has "San" and "Francisco" placed next to each other in order .

One technique could be bi-gram . Index two words as one word "San Francisco" what you will do for tri-word and quad-wrods ?

Biwords index is not standard solution for phrase search .

We need to store positions for each posting in postings list .

Say we hve

term   postings (already sorted)

San 1 ,4,6,8,10
Francisco 4,8,22

Now we will have positional index for each posting

for word "San" we have

-------------------------------------------------------
| posting |   offsets |
-------------------------------------------------------
| 1 | 7,8,10,100 |
| 4 |    1 |
| 6 | 4,8 |
| 8 | 7,9,10 |
| 10 | 11,12 |
-------------------------------------------------------

for word "Francisco" we have

-------------------------------------------------------
| posting |   offsets |
-------------------------------------------------------
| 4 | 2.6,9 |
| 8 | 15 ,60   |
| 22 | 1,2,3 |
-------------------------------------------------------

For phrase search "San Francisco" we will have result posting 4 because San is at offset position 1 and Francisco is at offset position 2 so we have these two terms next to one another.

Now while merging postings you have to consider offsets also .
Because of this merge will be come complex and non-trivial problem.

Now you can also search for proximity search means with number of words between query words.

Positional index are costly it will increase index by approx. 30% by size while it will waste 25 % of search time .
So we have to use use mixed techniques like common queries use bi-words example "tale of two cities " index this as one words and also as bi-word.
For rare words we can use positional index .
So mix of strategy could help . This could also done based on search statistics like if some phrase is searched again and again then why to use positional index lets index whole term as it is .

Boolean search vs ranked search

Above all discussion is related to Boolean search which can answer yes/no type question but what about relevancy ? means we want to get most relevant documents first then less relevant .

For Boolean search we might get too few results with ANDs and too many with ORs which is like feast or famine .

So really art of forming perfect query lies with user . So user need to be more expert on it and we do not want to increase learning curve of user .
So ranked search is for rescue so that user always get top relevant /scored document first .
Ranked retrieval is essentially search results sorted by some kind of relevancy score generally it is between 0 to 1 .

Ranked retrieval are generally used for free text search . but for well defined fields with definite set of values Boolean search is better.

Ranked Search - Scoring Models

Jaccard Co-efficent

Commonly used to measure extend of overlap of two sets A & B .

Jaccard(A,B) = |A ∩ B | / |A U B |
Jaccard(A,A) = |A ∩ A | / |A U A | = A/A = 1

if A ∩ B = 0 then
Jaccard(A,B) = 0

A and B may not have same size . So it might skew it for larger set .

Example

Query , Q = "ides of march "
Document , D1 = "caesar died in march"
Document D2 = "the long march"
Document D2 = "the ides march "

Jaccard(Q,D1) = | Q ∩ D1 | / | Q U D1 | = 1/6 = 0.16

Jaccard(Q,D2) = | Q ∩ D2 | / | Q U D2| = 1/ 5 = 0.2

Jaccard(Q,D3) = | Q ∩ D3 | / | Q U D3| = 2/ 4 = 0.5

Observer document is smaller number of word wins .

Jaccard's scoring problem

1. Does not consider terms frequency , generally document with more frequency of words and which are globally in other document rare are more relevant .

2. Need normalization for document length because shorter document always wins than longer document .

Terms frequency weighting

We can create Document -Term incident metrics as actual number of frequencies than zero one based vector.

----------------------------------------------------------
| terms doc1 doc2 doc3 doc4 |
----------------------------------------------------------
antony 0 10 0 12

brutus 11 100 50 11

casear 0 0 0 1 1000
----------------------------------------------------------

But if documents contain same words with combination variation then both documents will have same score .
Example :

Document D1 = Jane is quicker than John
Document D2 = John is quicker than Jane

Document D1 and D2 have same frequency vector.

We need to have positional information for phrase search .

Term Frequency (tf -t,d)

tf -t,d - term t occurring in document d.

Number of time the term occurs in one document .

So document with more more frequent terms is relevant than document with terms has low frequency. But that is not always case . If words is so common in whole collection then importance of that words words should reduce . So we need to dampen globally occurring words but enhance locally frequent word.

We can dampen the total value by taking log of tf-t,d.

w-t,d = 1 + log(tf -t,d) if tf-t,d > 0

w-t,d = 0 if  tf-t,d ==0

Score = summation of all w-t,d for all terms in query .

Score = ∑ t ∈ q ∩ d  w-t,d

Inverse Document Frequency

In search if documents have terms from query which are rare across collection but frequent in the particular document then score of such document need to be higher because generally rare terms are more informative than frequent one .
For example - frequent words like hello , hi , all should be scored low.

df -t = number of document which contains the term t . Repeated occurrence is not counted in same document.

df-t <= N

N = total number of documents in the collection

idf = inverse document frequency

idf-t = log (N/df-t)

Here log function is used to dampen the value .
idf value is between 0 to logN

Example

N = 1000000 = 1 million

---------------------------------------------------------------------------------
term df-t idf-t
---------------------------------------------------------------------------------

calpurnia 1 log(1000000/1) = 6

animal 100 4

sunday 1000 3

fly 10000 2

the 1000000 0

-----------------------------------------------------------------------------------

tf-idf weighting

Wt,d = (1+log(tf-t,d)) * log (N/df-t)

tf-idf is multiplication of term frequency multiply by inverse document frequency .

Vector Space Model

Terms as axes while documents can be seen as points/vectors with wrights .
This N dimensional system with space vectors .

Query and documents are vectors in same space we need to find document vectors which are closer to query vectors means more similar documents to query .

Here say we have three documents with terms containing gossip or jealous .
d1 is more about gossip
d3 is more about jealous
d2 is more about both

Query is expecting two word at same weight then document is more relevant .

Here document d2 is more relevant for that measuring Euclidean distance is not good idea we need to measure angle between unit vectors.

Smaller the angle more similarity is the document to query so we have to take cosine of angle between unit vectors.

cosine similarity is dot product of unit vectors.

`cos(q,d) = q.d/ |q|*|d| = q / |q| * d / |d|

q is tf-idf weight of term i in the query q .
d is tf-idf of term i in document d .

Ranked retrieval in IR

We have term frequency and inverse document frequency so we combine both to create tf-idf which is length of vector then we normalize query term and document term vector to calculate angle between them which represents the score for ranked retrieval . IR system will give you documents sorted by score.

For calculating term frequency and inverse document frequency and for normalization people might use different formula other than discussed here .

Information Retrieval - Notes

Information Retrieval :

Process of finding documents of un-structured nature (usually converted to text ) which satisfies information need from large collection .

Example :

Email search
Desktop search
Legal Search
Search knowledge base
Web search - Google/Bing etc.

Information Retrieval - Notes

Information Retrieval :

Process of finding documents of un-structured nature (usually converted to text ) which satisfies information need from large collection .

Example :

Email search
Desktop search
Legal Search
Search knowledge base
Web search - Google/Bing etc.

Saturday, October 15, 2016

LinkedList top interview questions

I am listing here selected interview questions for linked-list

Scope

Singly linked list
Doubly linked list
Unrolled linked list
Skip List
Linked Array List
Memory efficient linked lists

Lateral Thinking points -

Compare linked list with array using aspects on time and space . Which one to use in which situation ?
Recursive or iterative solution
Linear/Binary Searching
Indexing using hash-table /map
Two runner technique
Brute force using nested loops.
Time , Space trade-off
Number of scans of linkedlist
Cost of saving next and or prev pointers

Easy

Implement linked list with operations like add , delete , put using index position also implement addFirst , addLast , deleteFirst , deleteLast.
Implement stack using linkedlist.
Implement queue using linkedlist .
Reverse linkedlist .
Traverse circular linked list .
Remove duplicated from linkedlist.

Medium

Find kth node from end .
Find middle of linkedlist
Find cycles in linkedlist and find entry point of loop.
Insert node in sorted linked list
find the merge point of two linkedlist which have different length.
Merge two sorted linked lists.
Implement Josephus Circle using linkedlist
Delete middle of linked list , you have access to middle element only.

Advanced

Create linkedlist with O(1) access time.
Reverse k blocks of linked list
Reverse each pair of linked list
Split circular linkedlist
Sort linked list using insertion sort .
Sort linked list using merge sort .
Segregate linked list (data in link is integer) in two list of even and odd numbers.
Add/Subtract/Multiply/Divide two linked list which represents number .
Partition linked list around the pivot value .
Check whether two linked lists are palindrome.

Tough

Implement CRUD on skiplist
Implement CRUD on unrolled linked list
Implement CRUD on Linked-Array List
Implement LRU/MRU cache

Friday, October 28, 2016

Information Retrieval - Notes

Information Retrieval - Notes

Information Retrieval - Notes

Saturday, October 15, 2016

LinkedList top interview questions

Nilesh Salpe