Saturday, November 5, 2016

Composite versus Decorator versus Proxy

From GOF

Composite  and Decorator  have similar structure diagrams, reflecting the fact that both rely
on recursive composition to organize an open-ended number of objects. This commonality might tempt  you to think of a decorator object as a degenerate composite, but that misses the point of the Decorator pattern. The similarity ends at recursive composition, again because of differing intents.
Decorator is designed to let you add responsibilities to objects without sub-classing. It avoids the
explosion of sub-classes that can arise from trying to cover every combination of responsibilities
statically. Composite has a different intent. It focuses on structuring classes so that many related objects can be treated uniformly, and multiple objects can be treated as one. Its focus is not on embellishment but on representation.

These intents are distinct but complementary. Consequently, the Composite and Decorator patterns are  often used in concert. Both lead to the kind of design in which you can build applications just by
plugging objects together without defining any new classes. There will be an abstract class with some
sub-classes that are composites, some that are decorators, and some that implement the fundamental
building blocks of the system. In this case, both composites and decorators will have a common
interface. From the point of view of the Decorator pattern, a composite is a Concrete Component. From the point of view of the Composite pattern, a decorator is a Leaf. Of course, they don't have to be used together and, as we have seen, their intents are quite different.

Another pattern with a structure similar to Decorator's is Proxy .  Both patterns describe how to
provide a level of indirection to an object, and the implementations of both the proxy and decorator
object keep a reference to another object to which they forward requests. Once again, however, they are  intended for different purposes.

Like Decorator, the Proxy pattern composes an object and provides an identical interface to clients.
Unlike Decorator, the Proxy pattern is not concerned with attaching or detaching properties dynamically,  and it's not designed for recursive composition. Its intent is to provide a stand-in for a subject when it's  inconvenient or undesirable to access the subject directly because, for example, it lives on a remote  machine, has restricted access, or is persistent.

In the Proxy pattern, the subject defines the key functionality, and the proxy provides (or refuses) access  to it. In Decorator, the component provides only part of the functionality, and one or more decorators  furnish the rest. Decorator addresses the situation where an object's total functionality can't be determined at compile time, at least not conveniently. That open-endedness makes recursive composition an essential part of Decorator. That isn't the case in Proxy, because Proxy focuses on one
relationship—between the proxy and its subject—and that relationship can be expressed statically.

These differences are significant because they capture solutions to specific recurring problems in object oriented  design. But that doesn't mean these patterns can't be combined. You might envision a proxy decorator  that adds functionality to a proxy, or a decorator-proxy that embellishes a remote object.  Although such hybrids might be useful (we don't have real examples handy), they are divisible into  patterns that are useful.


Code Examples

Composite


Decorator



Proxy



Adapter versus Bridge design pattern


From GOF

The Adapter and Bridge  patterns have some common attributes. Both promote flexibility by
providing a level of indirection to another object. Both involve forwarding requests to this object from an interface other than its own.

The key difference between these patterns lies in their intents. Adapter focuses on resolving
incompatibilities between two existing interfaces. It doesn't focus on how those interfaces are
implemented, nor does it consider how they might evolve independently. It's a way of making two
independently designed classes work together without re-implementing one or the other. Bridge, on the other hand, bridges an abstraction and its (potentially numerous) implementations. It provides a stable interface to clients even as it lets you vary the classes that implement it. It also accommodates new implementations as the system evolves.

As a result of these differences, Adapter and Bridge are often used at different points in the software
life-cycle. An adapter often becomes necessary when you discover that two incompatible classes should work together, generally to avoid replicating code. The coupling is unforeseen. In contrast, the user of a bridge understands up-front that an abstraction must have several implementations, and both may evolve  independently. The Adapter pattern makes things work after they're designed; Bridge makes them work  before they are. That doesn't mean Adapter is somehow inferior to Bridge; each pattern merely addresses  a different problem.

You might think of a facade  as an adapter to a set of other objects. But that
interpretation overlooks the fact that a facade defines a new interface, whereas an adapter reuses an old interface. Remember that an adapter makes two existing interfaces work together as opposed to defining  an entirely new one.

Please refer examples

Adapter


Bridge


Friday, October 28, 2016

Information Retrieval - Notes



Information Retrieval :

Process of finding documents of un-structured nature (usually converted to text ) which satisfies information need from large collection .

Example :
  • Email search 
  • Desktop search 
  • Legal Search 
  • Search knowledge base
  • Web search - Google/Bing etc.
Unstructured data is increasing day by day compared structured data .

Information retrieval could be on static or dynamic collection .

Normal Flow for static collection

1. Feed collection of documents to search engine
2. Frame query for information need
3. Analyze results
4. Refine the query (adding removing some terms , adding correct operators ) if results are not as per expectations
5. Do search with refined query  continue till you get what you want .

How to compare two information retrieval engines ?

Precision : Fraction of retrieved documents which are relevant to user 's information need.
If we get 10 documents as result only 2 are relevant to user the precision is 2/10.

Recall : Fraction of relevant documents  in collection that are retrieved .
In collection there are actual 20 documents which are relevant but query retrieves  10 documents out of which only 2 are relevant then recall is 2/20.


Following things can affect Precision and Recall

1. Query formulation
2. Underlying search engine logic while indexing and searching .


Techniques 


Linear Scan /Exhaustive search

Search all documents line by line for query say using grep in linux.
Above approach is time consuming and might be need multiple scans .
Can not process complex query like proximity , ranked search results.

it is time and memory in-efficient .

Core technique is to build inverted index .

Document -Term incident metrics 

Example :

We have say 4 documents with all containing 4 distinct terms.

----------------------------------------------------------
 terms              doc1   doc2     doc3     doc4   
----------------------------------------------------------
antony            0            1         0           1

brutus             1          1            1          0

casear             1          1          1            1
----------------------------------------------------------

1 represents that terms is in document while zero represents that terms is not present.

terms are total distinct terms in all documents .



Not possible to build metrics for even say 1 million documents .

For large information this metrics is sparse .
Instead of storing documents ids (postings ) which represents unmatched comments we should store only document ids  (postings ) for matched documents .


Query Processing 

Logical Operators like AND

Example

------------------------------------------------------
|  term                    |       postings                   |
|                              |       (already sorted)        |
-------------------------------------------------------
|     brutus                |         1 ,4,6,8,10             |
|    caesar                 |          3,6,8,11,12           |
-------------------------------------------------------
                       

Say you query is "brutus AND caesar "

Then you need to merge postings are two terms .
Result will be 6,8 .

You can keep positing list  sorted and do merge using merge sort sub-routing .
You need to do N way merge in memory by loading batch  of posting one by one as per need.

Phrase search 

Say you want to search  "San Francisco " then we need to search entry which has "San" and "Francisco" placed next to each other in order .

One technique could be bi-gram . Index two words as one word    "San Francisco" what you will do for tri-word and quad-wrods ?

Biwords index is not standard solution for phrase search .

We need to store positions for each posting in postings list .

Say we hve

term                                                            postings  (already sorted)

San                                                       1 ,4,6,8,10
Francisco                                              4,8,22

Now we will have positional index for each posting

for word "San" we have

-------------------------------------------------------
| posting          |                     offsets               |
-------------------------------------------------------
|      1                     |                      7,8,10,100  |
|      4                     |                      1                 |
|      6                     |                      4,8              |
|      8                     |                      7,9,10         |
|      10                   |                      11,12          |
-------------------------------------------------------

for word "Francisco" we have


-------------------------------------------------------
posting               |                     offsets          |
-------------------------------------------------------
|      4                     |                      2.6,9           |
|      8                     |                    15 ,60           |
|      22                   |                      1,2,3           |
-------------------------------------------------------

For phrase search "San Francisco"  we will have result posting 4 because San is at offset position 1 and Francisco is at offset position 2 so we have these two terms next to one another.

Now while merging postings you have to consider offsets also .
Because of this merge will be come complex and non-trivial problem.

Now you can also search for proximity search means with number of words between query words.

Positional index are costly it will increase index by approx. 30% by size while it will waste 25 % of search time .
So we have to use use mixed techniques like common queries use bi-words example "tale of two cities " index this as one words and also as bi-word.
For rare words we can use positional index .
So mix of strategy could help . This could also done based on search statistics like if some phrase is searched again and again then why to use positional index lets index whole term as it is .

Boolean search vs ranked search

Above all discussion is related to Boolean search which can answer yes/no type question but what about relevancy ?  means we want to get most relevant documents first then less relevant .

For Boolean search we might get too few results with ANDs and too many with ORs  which is like feast or famine .

So really art of forming perfect query lies with user . So user need to be more expert on it and we do not want to increase learning curve of user .
So ranked search is for rescue  so that user always get top relevant /scored document first .
Ranked retrieval is essentially search results sorted by some kind of relevancy score generally it is between 0 to 1 .

Ranked retrieval are generally used for free text search . but for well defined fields with definite set of values Boolean search is better.

Ranked Search - Scoring Models

Jaccard  Co-efficent 

 Commonly used to measure extend of overlap of two sets A & B .

Jaccard(A,B) =  |A ∩ B | / |A U B |
Jaccard(A,A) =  |A ∩ A | / |A U A | = A/A = 1

if A ∩ B = 0  then 
Jaccard(A,B)  = 0

A and B may not have same size . So it might skew it for larger set .

Example

Query , Q = "ides of march "
Document , D1 = "caesar died in march"
Document D2 = "the long march"
Document D2 = "the ides march "


Jaccard(Q,D1) =  | Q ∩ D1 | / | Q U  D1 | =  1/6 =  0.16

Jaccard(Q,D2) =  | Q ∩ D2 | / | Q U  D2| =  1/ 5 =  0.2

Jaccard(Q,D3) =  | Q ∩ D3 | / | Q U  D3| =  2/ 4 =  0.5


Observer document is smaller number of word wins .

Jaccard's scoring problem

1. Does not consider terms frequency , generally document with more frequency of words and which are globally in other document  rare are more relevant .

2. Need normalization for document length because shorter document always wins than longer document .


Terms frequency weighting 

We can create Document -Term incident metrics  as actual number of frequencies than zero one based vector.

----------------------------------------------------------
| terms              doc1   doc2     doc3     doc4     |
----------------------------------------------------------
    antony            0           10         0           12

    brutus             11          100       50        11

    casear             0        0      0       1          1000
----------------------------------------------------------

But if documents contain same words with combination variation then both documents will have same score .
Example :

Document D1 = Jane is quicker than John
Document D2 = John is quicker than Jane

Document D1 and D2 have same frequency vector.

We need to have positional information for phrase search .


Term Frequency (tf -t,d)

tf -t,d - term t occurring in document d.

Number of time the term occurs in one document .

So document with more more frequent terms is relevant than document with terms has low frequency. But that is not always case . If words is so common in whole collection then importance of that words words should reduce . So we need to dampen globally occurring words but enhance locally frequent word.

We can dampen the total value by taking log of tf-t,d.

 w-t,d = 1 + log(tf -t,d)   if tf-t,d > 0  

 w-t,d = 0  if  tf-t,d ==0


Score = summation of all w-t,d for all terms in query .

Score = ∑ t ∈ q ∩ d  w-t,d 

Inverse Document Frequency 

In search if documents have terms from query which are rare across collection but frequent in the particular document then score of such document need to be higher because generally rare terms are more informative than frequent one .
For example -  frequent words like  hello , hi , all should be scored low.

df -t  = number of document which contains the term t . Repeated occurrence is not counted in same document.

df-t <= N

N = total number of documents in the collection

idf = inverse document frequency

idf-t = log (N/df-t)

Here log function is used to dampen the value .
idf value is between  0 to logN

Example

N = 1000000  = 1 million

---------------------------------------------------------------------------------
term                             df-t                            idf-t
---------------------------------------------------------------------------------

calpurnia                     1                               log(1000000/1) = 6

animal                         100                                4

sunday                         1000                             3

fly                                10000                            2

the                                1000000                        0

-----------------------------------------------------------------------------------


tf-idf weighting 

Wt,d = (1+log(tf-t,d)) * log (N/df-t)


tf-idf  is multiplication of term frequency multiply by inverse document frequency .


Vector Space Model 

Terms as axes while documents can be seen as points/vectors  with wrights .
This N dimensional system with space vectors .

Query and documents are vectors in same space we need to find document vectors which are closer to query vectors means more similar documents to query .








Here say we have three documents with terms containing gossip or jealous .
d1 is more about gossip
d3 is more about jealous
d2 is more about both

Query is expecting two word at same weight then document is more relevant .

Here document d2 is more relevant for that measuring Euclidean distance is not good idea we need to measure angle between unit vectors.

Smaller the angle more similarity is the document to query so we have to take cosine of angle between unit vectors.


cosine similarity is dot product of unit vectors.

`cos(q,d) = q.d/ |q|*|d|  = q / |q| * d / |d|

q is tf-idf weight of term i in the query q .
d is tf-idf of term i in document d .

Ranked retrieval in IR

We have term frequency and inverse document frequency so we combine both to create tf-idf which is length of vector then we normalize query term and document term vector to calculate angle between them which represents the score for ranked retrieval . IR system will give you documents sorted by score.

For calculating term frequency and inverse document frequency and for normalization people might use different formula other than discussed here .





Information Retrieval - Notes



Information Retrieval :

Process of finding documents of un-structured nature (usually converted to text ) which satisfies information need from large collection .

Example :
  • Email search 
  • Desktop search 
  • Legal Search 
  • Search knowledge base
  • Web search - Google/Bing etc.
Unstructured data is increasing day by day compared structured data .

Information retrieval could be on static or dynamic collection .

Normal Flow for static collection

1. Feed collection of documents to search engine
2. Frame query for information need
3. Analyze results
4. Refine the query (adding removing some terms , adding correct operators ) if results are not as per expectations
5. Do search with refined query  continue till you get what you want .

How to compare two information retrieval engines ?

Precision : Fraction of retrieved documents which are relevant to user 's information need.
If we get 10 documents as result only 2 are relevant to user the precision is 2/10.

Recall : Fraction of relevant documents  in collection that are retrieved .
In collection there are actual 20 documents which are relevant but query retrieves  10 documents out of which only 2 are relevant then recall is 2/20.


Following things can affect Precision and Recall

1. Query formulation
2. Underlying search engine logic while indexing and searching .


Techniques 


Linear Scan /Exhaustive search

Search all documents line by line for query say using grep in linux.
Above approach is time consuming and might be need multiple scans .
Can not process complex query like proximity , ranked search results.

it is time and memory in-efficient .

Core technique is to build inverted index .

Document -Term incident metrics 

Example :

We have say 4 documents with all containing 4 distinct terms.

----------------------------------------------------------
 terms              doc1   doc2     doc3     doc4   
----------------------------------------------------------
antony            0            1         0           1

brutus             1          1            1          0

casear             1          1          1            1
----------------------------------------------------------

1 represents that terms is in document while zero represents that terms is not present.

terms are total distinct terms in all documents .



Not possible to build metrics for even say 1 million documents .

For large information this metrics is sparse .
Instead of storing documents ids (postings ) which represents unmatched comments we should store only document ids  (postings ) for matched documents .


Query Processing 

Logical Operators like AND

Example

------------------------------------------------------
|  term                    |       postings                   |
|                              |       (already sorted)        |
-------------------------------------------------------
|     brutus                |         1 ,4,6,8,10             |
|    caesar                 |          3,6,8,11,12           |
-------------------------------------------------------
                       

Say you query is "brutus AND caesar "

Then you need to merge postings are two terms .
Result will be 6,8 .

You can keep positing list  sorted and do merge using merge sort sub-routing .
You need to do N way merge in memory by loading batch  of posting one by one as per need.

Phrase search 

Say you want to search  "San Francisco " then we need to search entry which has "San" and "Francisco" placed next to each other in order .

One technique could be bi-gram . Index two words as one word    "San Francisco" what you will do for tri-word and quad-wrods ?

Biwords index is not standard solution for phrase search .

We need to store positions for each posting in postings list .

Say we hve

term                                                            postings  (already sorted)

San                                                       1 ,4,6,8,10
Francisco                                              4,8,22

Now we will have positional index for each posting

for word "San" we have

-------------------------------------------------------
| posting          |                     offsets               |
-------------------------------------------------------
|      1                     |                      7,8,10,100  |
|      4                     |                      1                 |
|      6                     |                      4,8              |
|      8                     |                      7,9,10         |
|      10                   |                      11,12          |
-------------------------------------------------------

for word "Francisco" we have


-------------------------------------------------------
posting               |                     offsets          |
-------------------------------------------------------
|      4                     |                      2.6,9           |
|      8                     |                    15 ,60           |
|      22                   |                      1,2,3           |
-------------------------------------------------------

For phrase search "San Francisco"  we will have result posting 4 because San is at offset position 1 and Francisco is at offset position 2 so we have these two terms next to one another.

Now while merging postings you have to consider offsets also .
Because of this merge will be come complex and non-trivial problem.

Now you can also search for proximity search means with number of words between query words.

Positional index are costly it will increase index by approx. 30% by size while it will waste 25 % of search time .
So we have to use use mixed techniques like common queries use bi-words example "tale of two cities " index this as one words and also as bi-word.
For rare words we can use positional index .
So mix of strategy could help . This could also done based on search statistics like if some phrase is searched again and again then why to use positional index lets index whole term as it is .

Boolean search vs ranked search

Above all discussion is related to Boolean search which can answer yes/no type question but what about relevancy ?  means we want to get most relevant documents first then less relevant .

For Boolean search we might get too few results with ANDs and too many with ORs  which is like feast or famine .

So really art of forming perfect query lies with user . So user need to be more expert on it and we do not want to increase learning curve of user .
So ranked search is for rescue  so that user always get top relevant /scored document first .
Ranked retrieval is essentially search results sorted by some kind of relevancy score generally it is between 0 to 1 .

Ranked retrieval are generally used for free text search . but for well defined fields with definite set of values Boolean search is better.

Ranked Search - Scoring Models

Jaccard  Co-efficent 

 Commonly used to measure extend of overlap of two sets A & B .

Jaccard(A,B) =  |A ∩ B | / |A U B |
Jaccard(A,A) =  |A ∩ A | / |A U A | = A/A = 1

if A ∩ B = 0  then 
Jaccard(A,B)  = 0

A and B may not have same size . So it might skew it for larger set .

Example

Query , Q = "ides of march "
Document , D1 = "caesar died in march"
Document D2 = "the long march"
Document D2 = "the ides march "


Jaccard(Q,D1) =  | Q ∩ D1 | / | Q U  D1 | =  1/6 =  0.16

Jaccard(Q,D2) =  | Q ∩ D2 | / | Q U  D2| =  1/ 5 =  0.2

Jaccard(Q,D3) =  | Q ∩ D3 | / | Q U  D3| =  2/ 4 =  0.5


Observer document is smaller number of word wins .

Jaccard's scoring problem

1. Does not consider terms frequency , generally document with more frequency of words and which are globally in other document  rare are more relevant .

2. Need normalization for document length because shorter document always wins than longer document .


Terms frequency weighting 

We can create Document -Term incident metrics  as actual number of frequencies than zero one based vector.

----------------------------------------------------------
| terms              doc1   doc2     doc3     doc4     |
----------------------------------------------------------
    antony            0           10         0           12

    brutus             11          100       50        11

    casear             0        0      0       1          1000
----------------------------------------------------------

But if documents contain same words with combination variation then both documents will have same score .
Example :

Document D1 = Jane is quicker than John
Document D2 = John is quicker than Jane

Document D1 and D2 have same frequency vector.

We need to have positional information for phrase search .


Term Frequency (tf -t,d)

tf -t,d - term t occurring in document d.

Number of time the term occurs in one document .

So document with more more frequent terms is relevant than document with terms has low frequency. But that is not always case . If words is so common in whole collection then importance of that words words should reduce . So we need to dampen globally occurring words but enhance locally frequent word.

We can dampen the total value by taking log of tf-t,d.

 w-t,d = 1 + log(tf -t,d)   if tf-t,d > 0  

 w-t,d = 0  if  tf-t,d ==0


Score = summation of all w-t,d for all terms in query .

Score = ∑ t ∈ q ∩ d  w-t,d 

Inverse Document Frequency 

In search if documents have terms from query which are rare across collection but frequent in the particular document then score of such document need to be higher because generally rare terms are more informative than frequent one .
For example -  frequent words like  hello , hi , all should be scored low.

df -t  = number of document which contains the term t . Repeated occurrence is not counted in same document.

df-t <= N

N = total number of documents in the collection

idf = inverse document frequency

idf-t = log (N/df-t)

Here log function is used to dampen the value .
idf value is between  0 to logN

Example

N = 1000000  = 1 million

---------------------------------------------------------------------------------
term                             df-t                            idf-t
---------------------------------------------------------------------------------

calpurnia                     1                               log(1000000/1) = 6

animal                         100                                4

sunday                         1000                             3

fly                                10000                            2

the                                1000000                        0

-----------------------------------------------------------------------------------


tf-idf weighting 

Wt,d = (1+log(tf-t,d)) * log (N/df-t)


tf-idf  is multiplication of term frequency multiply by inverse document frequency .


Vector Space Model 

Terms as axes while documents can be seen as points/vectors  with wrights .
This N dimensional system with space vectors .

Query and documents are vectors in same space we need to find document vectors which are closer to query vectors means more similar documents to query .








Here say we have three documents with terms containing gossip or jealous .
d1 is more about gossip
d3 is more about jealous
d2 is more about both

Query is expecting two word at same weight then document is more relevant .

Here document d2 is more relevant for that measuring Euclidean distance is not good idea we need to measure angle between unit vectors.







Information Retrieval - Notes



Information Retrieval :

Process of finding documents of un-structured nature (usually converted to text ) which satisfies information need from large collection .

Example :
  • Email search 
  • Desktop search 
  • Legal Search 
  • Search knowledge base
  • Web search - Google/Bing etc.
Unstructured data is increasing day by day compared structured data .

Information retrieval could be on static or dynamic collection .

Normal Flow for static collection

1. Feed collection of documents to search engine
2. Frame query for information need
3. Analyze results
4. Refine the query (adding removing some terms , adding correct operators ) if results are not as per expectations
5. Do search with refined query  continue till you get what you want .

How to compare two information retrieval engines ?

Precision : Fraction of retrieved documents which are relevant to user 's information need.
If we get 10 documents as result only 2 are relevant to user the precision is 2/10.

Recall : Fraction of relevant documents  in collection that are retrieved .
In collection there are actual 20 documents which are relevant but query retrieves  10 documents out of which only 2 are relevant then recall is 2/20.


Following things can affect Precision and Recall

1. Query formulation
2. Underlying search engine logic while indexing and searching .


Techniques 


Linear Scan /Exhaustive search

Search all documents line by line for query say using grep in linux.
Above approach is time consuming and might be need multiple scans .
Can not process complex query like proximity , ranked search results.

it is time and memory in-efficient .

Core technique is to build inverted index .

Document -Term incident metrics 

Example :

We have say 4 documents with all containing 4 distinct terms.

----------------------------------------------------------
 terms              doc1   doc2     doc3     doc4   
----------------------------------------------------------
antony            0            1         0           1

brutus             1          1            1          0

casear             1          1          1            1
----------------------------------------------------------

1 represents that terms is in document while zero represents that terms is not present.

terms are total distinct terms in all documents .



Not possible to build metrics for even say 1 million documents .

For large information this metrics is sparse .
Instead of storing documents ids (postings ) which represents unmatched comments we should store only document ids  (postings ) for matched documents .


Query Processing 

Logical Operators like AND

Example

------------------------------------------------------
|  term                    |       postings                   |
|                              |       (already sorted)        |
-------------------------------------------------------
|     brutus                |         1 ,4,6,8,10             |
|    caesar                 |          3,6,8,11,12           |
-------------------------------------------------------
                       

Say you query is "brutus AND caesar "

Then you need to merge postings are two terms .
Result will be 6,8 .

You can keep positing list  sorted and do merge using merge sort sub-routing .
You need to do N way merge in memory by loading batch  of posting one by one as per need.

Phrase search 

Say you want to search  "San Francisco " then we need to search entry which has "San" and "Francisco" placed next to each other in order .

One technique could be bi-gram . Index two words as one word    "San Francisco" what you will do for tri-word and quad-wrods ?

Biwords index is not standard solution for phrase search .

We need to store positions for each posting in postings list .

Say we hve

term                                                            postings  (already sorted)

San                                                       1 ,4,6,8,10
Francisco                                              4,8,22

Now we will have positional index for each posting

for word "San" we have

-------------------------------------------------------
| posting          |                     offsets               |
-------------------------------------------------------
|      1                     |                      7,8,10,100  |
|      4                     |                      1                 |
|      6                     |                      4,8              |
|      8                     |                      7,9,10         |
|      10                   |                      11,12          |
-------------------------------------------------------

for word "Francisco" we have


-------------------------------------------------------
posting               |                     offsets          |
-------------------------------------------------------
|      4                     |                      2.6,9           |
|      8                     |                    15 ,60           |
|      22                   |                      1,2,3           |
-------------------------------------------------------

For phrase search "San Francisco"  we will have result posting 4 because San is at offset position 1 and Francisco is at offset position 2 so we have these two terms next to one another.

Now while merging postings you have to consider offsets also .
Because of this merge will be come complex and non-trivial problem.

Now you can also search for proximity search means with number of words between query words.

Positional index are costly it will increase index by approx. 30% by size while it will waste 25 % of search time .
So we have to use use mixed techniques like common queries use bi-words example "tale of two cities " index this as one words and also as bi-word.
For rare words we can use positional index .
So mix of strategy could help . This could also done based on search statistics like if some phrase is searched again and again then why to use positional index lets index whole term as it is .

Boolean search vs ranked search

Above all discussion is related to Boolean search which can answer yes/no type question but what about relevancy ?  means we want to get most relevant documents first then less relevant .

For Boolean search we might get too few results with ANDs and too many with ORs  which is like feast or famine .

So really art of forming perfect query lies with user . So user need to be more expert on it and we do not want to increase learning curve of user .
So ranked search is for rescue  so that user always get top relevant /scored document first .
Ranked retrieval is essentially search results sorted by some kind of relevancy score generally it is between 0 to 1 .

Ranked retrieval are generally used for free text search . but for well defined fields with definite set of values Boolean search is better.

Ranked Search - Scoring Models

Jaccard  Co-efficent 

 Commonly used to measure extend of overlap of two sets A & B .

Jaccard(A,B) =  |A ∩ B | / |A U B |
Jaccard(A,A) =  |A ∩ A | / |A U A | = A/A = 1

if A ∩ B = 0  then 
Jaccard(A,B)  = 0

A and B may not have same size . So it might skew it for larger set .

Example

Query , Q = "ides of march "
Document , D1 = "caesar died in march"
Document D2 = "the long march"
Document D2 = "the ides march "


Jaccard(Q,D1) =  | Q ∩ D1 | / | Q U  D1 | =  1/6 =  0.16

Jaccard(Q,D2) =  | Q ∩ D2 | / | Q U  D2| =  1/ 5 =  0.2

Jaccard(Q,D3) =  | Q ∩ D3 | / | Q U  D3| =  2/ 4 =  0.5


Observer document is smaller number of word wins .

Jaccard's scoring problem

1. Does not consider terms frequency , generally document with more frequency of words and which are globally in other document  rare are more relevant .

2. Need normalization for document length because shorter document always wins than longer document .


Terms frequency weighting 

We can create Document -Term incident metrics  as actual number of frequencies than zero one based vector.

----------------------------------------------------------
| terms              doc1   doc2     doc3     doc4     |
----------------------------------------------------------
    antony            0           10         0           12

    brutus             11          100       50        11

    casear             0        0      0       1          1000
----------------------------------------------------------

But if documents contain same words with combination variation then both documents will have same score .
Example :

Document D1 = Jane is quicker than John
Document D2 = John is quicker than Jane

Document D1 and D2 have same frequency vector.

We need to have positional information for phrase search .


Term Frequency (tf -t,d)

tf -t,d - term t occurring in document d.

Number of time the term occurs in one document .

So document with more more frequent terms is relevant than document with terms has low frequency. But that is not always case . If words is so common in whole collection then importance of that words words should reduce . So we need to dampen globally occurring words but enhance locally frequent word.

We can dampen the total value by taking log of tf-t,d.

 w-t,d = 1 + log(tf -t,d)   if tf-t,d > 0  

 w-t,d = 0  if  tf-t,d ==0


Score = summation of all w-t,d for all terms in query .

Score = ∑ t ∈ q ∩ d  w-t,d 

Inverse Document Frequency 

In search if documents have terms from query which are rare across collection but frequent in the particular document then score of such document need to be higher because generally rare terms are more informative than frequent one .
For example -  frequent words like  hello , hi , all should be scored low.

df -t  = number of document which contains the term t . Repeated occurrence is not counted in same document.

df-t <= N

N = total number of documents in the collection

idf = inverse document frequency

idf-t = log (N/df-t)

Here log function is used to dampen the value .
idf value is between  0 to logN

Example

N = 1000000  = 1 million

---------------------------------------------------------------------------------
term                             df-t                            idf-t
---------------------------------------------------------------------------------

calpurnia                     1                               log(1000000/1) = 6

animal                         100                                4

sunday                         1000                             3

fly                                10000                            2

the                                1000000                        0

-----------------------------------------------------------------------------------























Saturday, October 15, 2016

LinkedList top interview questions

I am listing here selected interview questions for linked-list

Scope


  • Singly linked list 
  • Doubly linked list 
  • Unrolled linked list 
  • Skip List 
  • Linked Array List 
  • Memory efficient linked lists



Lateral Thinking points -

  1. Compare linked list with array using aspects on time and space . Which one to use in which situation ?
  2. Recursive or iterative solution
  3.  Linear/Binary Searching 
  4. Indexing using hash-table /map
  5. Two runner technique 
  6. Brute force using nested loops.
  7. Time , Space trade-off 
  8. Number of scans of linkedlist 
  9. Cost of saving next and or prev pointers 


Easy


  1. Implement linked list with operations like add , delete , put using index position also implement addFirst , addLast , deleteFirst , deleteLast.
  2. Implement stack using linkedlist.
  3. Implement queue using linkedlist .
  4. Reverse linkedlist .
  5. Traverse circular linked list .
  6. Remove duplicated from linkedlist.



Medium


  1. Find kth node from end .
  2. Find middle of linkedlist 
  3. Find cycles in linkedlist and find entry point of loop.
  4. Insert node in sorted linked list 
  5. find the merge point of two linkedlist which have different length.
  6. Merge two sorted linked lists.
  7. Implement Josephus Circle using linkedlist 
  8. Delete middle of linked list , you have access to middle element only.


Advanced


  1. Create linkedlist with O(1) access time.
  2. Reverse k blocks of linked list 
  3. Reverse each pair of linked list 
  4. Split circular linkedlist
  5. Sort linked list using insertion sort .
  6. Sort linked list using merge sort .
  7. Segregate linked list (data in link is integer) in two list of even and odd numbers.
  8. Add/Subtract/Multiply/Divide two linked list which represents number .
  9. Partition linked list around the pivot value .
  10. Check whether two linked lists are palindrome. 




Tough

  1. Implement CRUD on skiplist
  2. Implement CRUD on unrolled linked list 
  3. Implement CRUD on Linked-Array List 
  4. Implement LRU/MRU cache


Sunday, September 25, 2016

Inheritance versus Composition

This is fairly basic topic in OOAD but I have seen most of programmers do not understand subtleties .
Mostly they do not want to think hard and make inheritance as default choice of implementation for re-use .

Finally end up with BIG FAT parent class extended by many classes with lot of if-else conditions and only god know what is the mess in that class .

Inheritance is achieved by extending class or interface/interfaces in java .
Composition is achieved by building stronger HAS_A relationship .

Inheritance (Compile Time Structure)

1. Uses sub-classing  or extending the base class (interface or class).
2. It is called white-box re-use because it knows internals of parent class.
3. New functionality is achieved by partially or completely re-defining of parent functionality .
4. Defined statically at compile time and easy to understand.
5. Inheritance break encapsulation - because it has access to internal state of parent class . If some thing is changed at parent class we need re-test all extended classes also . It might cause ripple effect of change if you change anything in parent class to all sub-classes.
6. To use inheritance you should know all internal state and documentation of parent class so there is learning curve .
7. Relationships can not be re-defined at run time .
8. Has to deal with small number of classes.

Composition (Run Time Structure)

1. Uses composition of small objects which have well defined interfaces .
2.It is called black-box re-use because object using the object do not know internal details of the object .
3. New functionality is developed by using small functionality of number of objects.
4. Composition have well  defined interfaces to all small objects so as long as there interface is same user object does not care details .
5. Composition promotes encapsulation allowing to change parent and subclass to change and evolve independently .
6. To use composition you do not need to know internal state of objects and once you know interfaces there is very small learning curve so code is easy to understand .
7. As per need you can compose object with large number of small objects to mimic required behavior .
8. Has to deal with large number of small classes.


So suppose you have small N number of small objects with specific focused behavior ideally you should able to compose  2^n possible compositions .

So Remember favor object composition over class inheritance for more testable code , less learning curve and more encapsulation  so reuse .  



Saturday, September 24, 2016

Angular JS interview questions


Design patterns in one page


Main Opps concepts are classes, object ,  inheritance , polymorphism , abstraction  and  encapsulation .
Classes can have fixed state and fixed behavior  while object can have dynamic state and behavior .

You can use either object composition or class inheritance (class inheritance or interface inheritance ) while designing classes .Inheritance is tied to class at compile time .So such behavior is static at run-time . This is the main theme used class scoped  design patterns .


You can create large number of granular objects and use dynamic composition through polymorphism  to change behavior  at run-time as per your need. This is major theme used in object scoped  design patterns .

While designing always program  to interface and not to implementation so that code is reusable to client  and owner could plug-in any implementation as per need which client is unaware of . This is also major theme in most of design patterns .

Main Concepts :

Inheritance
 (Extending abstract class , implementing one or many interfaces / abstraction/generalization/realization/Is_A  )

Inheritance is fixed at compile time so it has static behavior .

Composition (Association/Aggregation/composition / Has_A )
Composition can be decided at run time so it has  dynamic behavior  .


Creation Patterns 

Abstract Factory (Kit)
Provide an interface for creating families of related or dependent objects without specifying their
concrete classes.

Builder
Separate the construction of a complex object from its representation so that the same construction
process can create different representations.

Factory Method
Define an interface for creating an object, but let sub-classes decide which class to instantiate. Factory
Method lets a class defer instantiation to sub-classes.

Prototype
Specify the kinds of objects to create using a prototypical instance, and create new objects by
copying this prototype.

Singleton
Ensure a class only has one instance, and provide a global point of access to it.


Structural Patterns 

Adapter
Convert the interface of a class into another interface clients expect. Adapter lets classes work
together that couldn't otherwise because of incompatible interfaces.

Bridge
Decouple an abstraction from its implementation so that the two can vary independently.

Composite
Compose objects into tree structures to represent part-whole hierarchies. Composite lets clients treat
individual objects and compositions of objects uniformly.

Decorator
Attach additional responsibilities to an object dynamically. Decorators provide a flexible alternative
to sub-classing for extending functionality.

Facade
Provide a unified interface to a set of interfaces in a subsystem. Facade defines a higher-level
interface that makes the subsystem easier to use.

Flyweight
Use sharing to support large numbers of fine-grained objects efficiently.

Proxy
Provide a surrogate or placeholder for another object to control access to it.


Behavior Pattern 

Chain of responsibility
Avoid coupling the sender of a request to its receiver by giving more than one object a chance to
handle the request. Chain the receiving objects and pass the request along the chain until an object
handles it.

Command
Encapsulate a request as an object, thereby letting you parameterize clients with different requests,
queue or log requests, and support undo-able operations.

Interpreter
Given a language, define a representation for its grammar along with an interpreter that uses the
representation to interpret sentences in the language.

Iterator
Provide a way to access the elements of an aggregate object sequentially without exposing its
underlying representation.

Mediator
Define an object that encapsulates how a set of objects interact. Mediator promotes loose coupling by
keeping objects from referring to each other explicitly, and it lets you vary their interaction

independently.

Memento
Without violating encapsulation, capture and externalize an object's internal state so that the object
can be restored to this state later.

Observer
Define a one-to-many dependency between objects so that when one object changes state, all its

dependents are notified and updated automatically.

State
Allow an object to alter its behavior when its internal state changes. The object will appear to change
its class.

Strategy
Define a family of algorithms, encapsulate each one, and make them interchangeable. Strategy lets
the algorithm vary independently from clients that use it.

Template Method
Define the skeleton of an algorithm in an operation, deferring some steps to sub-classes. Template
Method lets sub-classes redefine certain steps of an algorithm without changing the algorithm's
structure.

Visitor
Represent an operation to be performed on the elements of an object structure. Visitor lets you define
a new operation without changing the classes of the elements on which it operates.


Non GOF patterns 

Object Pool
MVC pattern
Dependency Injection


Practical Hybrid Patterns :

Factory Method + Prototype +Singleton

If you have to create very large number of types of objects using factory method then you need to pass some type identifier are parameter to factory method , then in factory method you have to create large number of if-else clauses or long switch clauses to implement logic .
To void this you can keep map or registry of type identifier  to object . The sub-classes should be clone-able or have copy constructor or they are singleton . So when request  comes to factory method with type identifier , look up in registry get  object of the type then either return clone of object or singleton object .
Instead of keeping registry  of objects you can keep mapping of type identifier to class then using reflection you can create objects on fly and return for the specified types .
Registry could be map /server or simply property /JON /XML file .



Web application security interview questions


Web Services Interview questions


Distributed Systems interview questions

Cassandra interview questions




Distributed Systems Concepts :

1. What will happen when nodes goes down and does not respond to co-coordinator node ?
2. What  synchronization is used in Cassandra nodes ? NTP or vector clock ?
3. How decide nodes in cluster with requirements for read and write load ?
4. What is CAP theorem ? Is it possible to achieve all three ? if not why ?
5. What is eventual consistency ?
6. How to tune read and write consistency in Cassandra ? How to get stronger consistency ? if we get strongest consistency level then how we will sacrifice  availability
7. What kind of hashing is used in Cassandra ?
8. What is consistent hashing  ?
9. Why write are faster in Cassandra ? Describe write path ?
9. Why read are slower  in Cassandra ? Describe read path ?
10. Compare read/write path in Cassandra and in mysql ?
11. When to choose Cassandra over mysql ?
12. Which consistency model Cassandra uses ? CA , AP and CP ? and why ? 
13. When some nodes are down , how Cassandra do replication ?
14. What is hinted hand off in Cassandra ?
15. What are consistency levels ? Are consistency levels per query or global ?
16. What are differences between ALL , QUORUM and ONE  what is effect on performance ?
17.  If my replication factor is 3 then In case of ONE  consistency level will Cassandra replicate  to 3 nodes ?
18. How to handle OLTP and OLAP load  ? what could be strategy ?

NOSQL interview questions


Elastic Search interview questions


Friday, September 9, 2016

Object Oriented Analysis and Design : OOAD Interview Questions (Scale , Design Patterns , Algorithms )


Nowadays nobody will ask you to design systems with limited number of users , always expectation would how you design highly available , partition tolerant , horizontally scaling secure cloud application .

Gone are the days ,  when interviewer was  asking to design chat server which works in one JVM process  for two users .

Now people need to start thinking laterally about highly scale-able, distributed,  parallel algorithms and how scale only by adding more hardware for dynamic load .
This fact is more essential now a days because architecture remains same for single user or millions of users , only difference is number of machines handling them in background .


Pre-requisite : 

 - Design patterns
- Distributed Systems
- Cloud Platform - AWS/OpenStack/Azure
- SQL/NO-SQL databases for big data , Tuning consistency in NO-SQL
- Map-Reduce - Hadoop/Apache Spark/ZooKeeper
- Micro- services architecture
- Rule /workflow engines
- Information retrieval - (Elastic search /Solr)
- Shared nothing architecture
- Synchronous systems (queue based producer /consumer) - P2P
- pub-sub models
- Caching (Write-through , Write-back , Write-around , eviction policies ) - Memcached, Redis



Design Goals - 


  • Reusable , Extensible design 
  • High performance ( Acceptable SLAs , Storage )
  • High Availability
  • Partition Tolerance
  • Handles big data
  • Security
  • Scale linearly (vertical and horizontal ) with more hardware .



General aim is throw ambiguous questions to candidate and see how he can navigate through open ended scenarios . How he asks questions , how he create use cases , how he consider performance , scale , security and SOLID principles   to design .

Such questions are suitable for Face to Face interview or given offline tasks with sufficient time. It is not suitable for telephonic interviews .

Approach 

Top to bottom approach 

Macro Level -

  1. Build basic use cases and in-variants .Understand success/warning/failure cases .Write down use cases one by one . Ask questions for scenarios using Ws like  What, Who, When , Where , Why and also How. 
  2. Understand broad level independent logical modules /packages/micro-services .
  3. Identify main entities /interfaces /classes in the system. Understand segregated interfaces . Do  not one fat interface .
  4. Understand basic actors in the system .
  5. Draw component /class /object /package diagram .
  6. Anticipate changes in future requirements like if some things are not dynamic today , those could be dynamic in future , adding similar type of specialized functionality etc.


Micro-Level 

  1. First do all interface based design , for all major classes (DAO classes , Coordinator classes )
  2. Consider design patterns if applicable , do not over use them to show your knowledge .
  3. Think multi-threaded approach with compute intensive / IO intensive activities .
  4. Understand basic tasks and separable boundaries so that you can create distributed processing of item .If task is fully independent  then it will be helpful to use distributed computing - Map-Reduce .
  5. Externalize configurable parts like  rules engine , thread-pool sizes in JSON , property file , XML etc.




General Questions 

1. Design Coffee Machine
2. Design lift system for a multi-storied building
3. Design Chat Server (typical pub-sub problem)
4. Design distributed cache
5. Design traffic signal system
6. Design bus/train scheduling system
7. Scheduling meetings with given number of meeting rooms.
8. Design online shopping site .
9. Design tiny URL system .
10. Do some kind of analytics on text for large number of files (typical Map-Reduce problem).
11. Design restaurant reservation system .
12. Design restaurant system with rule based pricing  and dynamic configurations for menu items .
13. Find connected components in large data . (say you have excel-sheet with some king of graph structure ).
14. Design Search Engine .
15. Design twitter
14. Design parking lot system.
15. Design custom re-entrant lock
16. Design custom thread pool.
17. Design Object thread pool or JDBC connection thread pool.
18. Design distributed file system.









Monday, August 8, 2016

Competencies of Software Architect


Basic Algorithms and Data Structures 


  • Big O notations 
  • Recursion and backtracking 
  • Linear data structures (linked list , dynamic list ,stack , queue , priority queue )
  • Sorting 
  • Searching (linear , binary , interpolation )
  • Hashing (Hash function , collision resolution  - chaining , open addressing , double hashing/universal hashing )
  • String Algorithms (KMP, Rabin Karp  , Boyce Moore)
  • Trees (Binary Tree , Binary Search Tree, Binary Heaps )
  • Tree Balancing algorithms (B-trees , AVL , 2-3 , red-black trees)
  • Graphs (Minimum Spanning Tree , Shortest Paths , Cycle detection , dynamic connectivity )


Advanced Algorithms and Data Structures 
  • Greedy Algorithms 
  • Dynamic Programming 
  • Linear Programming 
  • Probabilistic Data Structures 
  • Complexity classes (P,NP etc .)



Distributed Algorithm Concepts
  • 2 phase commit 
  • Paxos - solving consensus 
  • Leader election 
  • Consistent Hashing /Routing 
  • Synchronous/Asynchronous communication 
  • Fault tolerance
  • Eventual Consistency 
  • Replication 
  • pub-sub using distributed queues
  • Bloom Filters 


Distributed/Cloud Computing technologies 

  • Basics of AWS or OpenStack  or Azure or IBM bluemix 
  • No-SQL 
    • Key value store  - Redis , Memcached
    • Column Family - Cassandra 
    • Document - MongoDB , Elastic Search 
    • Graph Databases - Neo4j 
  • Distributed Queues (SQS , RabbitMQ , Kafka)
  • Distributed Locking /co-ordination  - Apache ZooKeeper 
  • Map-Reduce frameworks 
    • Hadoop eco-systems
    • Apache Spark 
  • REST web services (Jersey , apache CXF , Swagger )


REST/SOAP web services 
  1.  Micro-service architecture 
    • what is micro-service 
    • How to design micro-service which persists data.
  2. SOAP web services in general 
  3. Differences between SOAP and REST , why REST ?
  4.  REST Design principles 
    • Idempotent (GET , HEAD , PUT) , (POST,PATCH)
    • HATEOS  
    • Security  (stateless , Authentication and  Authorization m Security frameworks (OAuth , Basic on    SSL , HTTPS , Digest , error messages  ) )
    • Caching (cache-control , etag , validation , expiration )
    • HTTP headers used for rest (Media-Type , Cache Control , Auth , Accept , Content-Type, If-Non- Match )
    • Commonly used HTTP codes (1XX, 2XX, 3XX, 4XX, 5XX)
    • Searching , sorting , filtering and paging  (Link headers , rel , href , query parameters )
    • Documentation - Swagger 
    • HTTP 1.1 vs HTTP 2.0
    • Handling Asynchronous request (Notification URL , Push /Pull models ) 
    • URL designing conventions 
    • Versioning of APIs
    • Error Messaging (user message , developer message , http error codes 4XX, 5XX) 
    • JSON (snake_case , vs camelCase , DATE format)
  5. REST frameworks - Jersey , Apache CXF , Spring Framework  , Play framework , REST Easy 

Basic References :
http://www.vinaysahni.com/best-practices-for-a-pragmatic-restful-api
RESTful Web Services
http://www.amazon.in/dp/9352130693
http://www.amazon.in/dp/059680168


Search Technologies 
  • Elastic Search 
  • Apache Solr 

Object Oriented Design 

  • Design Patterns (at least design patterns from GOF)
  • SOLID design principles 
  • UML 


Java Technology Stack 

Java Language :
  • Java Language basics (grammar , collections , exception handling )
  • Concurrency 
    •   Java Memory Model 
    •   Pessimistic Concurrency control ( exclusive locks , read/write locks )
    •   Optimistic Concurrency control (CAS based operations )
    •   Java Concurrent Package (Locks , Concurrent collections , Semaphore , Latch , Cyclic Barriers ,  ThreadPools , Executors  etc.)
    • Thread dump analysis 
  • Lambda Expressions 
  • Custom Annotations 
  • Java Serialization Specifications 
  • JVM architecture 
    • Class loaders 
    • Garbage Collection algorithms 
    • Java Security 
  • Regex 



Source Code Management

  •  GIT
  •  PerForce 
  •  SVN 


Testing Frameworks 

  • TDD
    • Junit 
    • TestNG
  • BDD
    • Concumber 

Cryptography 

Cryptographic hash functions






Tuesday, July 12, 2016

Binary Tree Traversing algorithms

There could be surprise when  people ask you to print binary tree in some special ways .


 There following techniques for that

1. Recursion
2. Using stack/queue for non-recursion
3. Keeping track of  (X,Y) coordinates (start root with (0,0) ) of node  like when you go left decrease x , when you go right increase X

Following could be possible set of questions for iterating over binary tree (is may not be BST ).

Depth First : (with or without recursion )

1. Pre-order
2. In-order
3. Post-order

Breadth First :

1. Breadth first  - level order
2. Breadth first  - zig-zag , or alternate direction in level order or spiral order traversal

Special Order 

1. Print node or  sum of  nodes ,  with vertical columns in the binary tree.
2. Print sum of nodes which in diagonal of binary  tree.
3 .Print all nodes in circumference of the tree  means all nodes in root left , right and bottom borders .
4. Print bottom view of binary tree.
5. Print top view of binary tree.
6. print inner nodes of binary tree.
7. print outer layer of binary tree
8. print only leaf nodes
9. print periphery of binary tree



Other oddities could be there like

If you can write pre-order, in-order , post-order using recursion  always think how you can do using it stack or without recursion .

Same logics if possible apply for n-Array tree.



We will solve above all one by one in next blogs .