Sunday 22 July 2012

Fuzzy matching autocomplete library with inbuilt standalone http server in java

Past couple of days I had this idea in mind of implementing a autocomplete that uses fuzzy matching. For fuzzy matching of a partial string with all the strings in dictionary it uses the Levenshtein distance. Now finding the Levenshtein distance of given string with each string in dictionary is very inefficient. So I used the idea described here. Converted it to java and modified it to fit the needs.

It is available in 2 modes. First is the Http server mode in which you run the standalone server and call it directly from your ajax script.It returns data in json format with decreasing scores. For this you need to configure it by providing the file name containing the words for suggestion. To use it run the server and then use this url to get the results   http://server-ip/autoc?word=word_for_autocomplete&tf=max_typos_allowed
Specify the word_for_autocomplete and the max_typos_allowed parameter.

The second is you can embed it directly in your application by first creating trie and then calling the getResult() method in CHandler class. See the source code for better understanding. A sample main method is given which uses the second method.

You can find the source code for the project here.

If you like the idea and want to improve it or port it to other languages fork me github.

Sunday 8 July 2012

Find a word in 2D matrix

I read about this problem in TC forums.

How to efficiently search for a word in a 2D matrix of letters , given the conditions that , we can start at any position and for every position we can go to only the adjacent positions (vertically,horizontally and diagonally ,i.e in all 8 directions).
Example- searching topcoder in this matrix :
t o p t
g f c q
t y o d
d f r e



My solution to this problem is a simple recursive DFS. Along with yes or no it highlights the word itself in the matrix by displaying the word and '-' at all other locations.  My implementation is as shown below.
You can also find it here


import java.util.*;
import java.io.*;
public class Main{
        public static int flag=0,m,n,dx[]={0,0,1,-1,1,-1,-1,1};
        public static int dy[]={1,-1,0,0,1,-1,1,-1};
        public static char wrd[];
        static class S{char a;int p,x,y;
                S(){}
                S(char b,int pos,int c,int d){a=b;p=pos;x=c;y=d;}
        }
        public static LinkedList<S>vis=new LinkedList<S>();
        public static LinkedList<S>init=new LinkedList<S>();
  public static Stack<S>fc=new Stack<S>();
        public static char mat[][];
        public static void main(String args[]) throws Exception{
                BufferedReader br=new BufferedReader(new InputStreamReader(System.in));
                m=Integer.parseInt(br.readLine());
                n=Integer.parseInt(br.readLine());
                int i,j;
    long a=System.currentTimeMillis();
                mat=new char[m][n];
    wrd=br.readLine().toCharArray();
                for(i=0;i<m;i++) mat[i]=br.readLine().toCharArray();
                for(i=0;i<m;i++) for(j=0;j<n;j++) if(mat[i][j]==wrd[0]){init.add(new S(wrd[0],0,i,j));}
                for(i=0;i<init.size();i++) {solve(init.get(i));if(flag==1) break;}
    if(flag==0) {System.out.println("NO");return;}
    for(i=0;i<n;i++) for(j=0;j<n;j++) mat[i][j]='-';
    S tp=null;
    for(;!fc.isEmpty();) {tp=fc.pop();mat[tp.x][tp.y]=tp.a;}
    for(i=0;i<m;i++) {
     for(j=0;j<n;j++) System.out.print(mat[i][j]);
     System.out.println();
    }
    System.out.println((System.currentTimeMillis()-a)+"ms");
        }
        public static void solve(S s){
   int i,j;
   if(s.p==wrd.length-1){System.out.println("YES");fc.push(s);flag=1;return;}
            for(i=0;i<8;i++) 
    if((s.x+dx[i])>0&&(s.y+dy[i])>0&&(s.x+dx[i])<m&&(s.y+dy[i])<n&&mat[s.x+dx[i]][s.y+dy[i]]==wrd[s.p+1]) 
     solve(new S(wrd[s.p+1],(s.p+1),(s.x+dx[i]),(s.y+dy[i])));
            if(s.p>=0&&s.p<wrd.length&&flag==1){fc.push(s);}
        }
}
21
21
topcoder
ataabmtdvegybznbzqwtb
odiurtrdutatureiortzn
godytqoqdxjwvxbnxwaxm
jhhxicpsqzfaxcctcefck
lcjsoideqlcwevkovrfyj
padervwrtjjthnjpbtuth
qacjgkaektdewmhsnyyfg
zbxbigjdgrythhgdmudef
crufjakjerwwsdffliywd
vqzbnkmfjtgrwkdgkodas
naohuyhfgofgfjuhjrvaa
mjgfdskuifyiumyjgleaw
lrepwoiepodsobtbdjdcf
hfueydfughyuicavsfosh
iovhjuhodjitgzzcaxctg
hjgsiugfyiusfgioqzpeb
jbhsfgudfghdhfghgsoqv
fgyhdutsfyirtyiytrtac
gjihysoduftfuihfytadx
gyrwydfiuzadytiudfufs
jgiosjtisodfstotisryt
YES
---------------------
---------------------
---------------------
---------------------
---------------------
---------------------
---------------------
---------------------
---------------------
---------------------
-----------------r---
------------------e--
------------------d--
------------------o--
------------------c--
------------------p--
------------------o--
------------------t--
---------------------
---------------------
---------------------
125ms

Trie datastructure

Trie is a very useful data structure. It finds many uses in string based applications like dictionary, autocomplete, spell check etc.
This blog post is not a tutorial on trie but I am simply sharing my trie implementation in java here.
You could find many other implementations on the internet but I really struggled understanding the other implementations especially the function of getting complete words matching a prefix. So I decided to write my own. Here is my implementation.
You can also find it here

class Node{
	public char d;public boolean e;
	public LinkedList<Node>c;
	public Node(){}
	public Node(char a){d=a;e=Boolean.FALSE;c=new LinkedList<Node>();}
	public Node sub(char a){
		int i;
		if(c.isEmpty()) return null;
		for(i=0;i<c.size();i++)
			if(c.get(i).d==a) return c.get(i);
		return null;
	}
}
class Trie{
	Node root;
	public Trie(){
		root=new Node(' ');
	}
	public void add(String a){
		Node n,cur=root;int len=a.length(),i,j;
		if(len==0) cur.e=Boolean.TRUE;
		else{
			for(i=0;i<len;i++){
				if(cur.sub(a.charAt(i))==null) break;
				cur=cur.sub(a.charAt(i));
			}
			while(i<len){
				n=new Node(a.charAt(i));
				cur.c.add(n);
				cur=cur.sub(a.charAt(i));
				i++;
			}
			cur.e=Boolean.TRUE;
		}
	}
	public boolean search(String a){
		Node cur=root;int i,j,len=a.length();
		for(i=0;i<len;i++) {
			if(cur.sub(a.charAt(i))==null) return Boolean.FALSE;
			cur=cur.sub(a.charAt(i));
		}
		if(i==len) return Boolean.TRUE;
		return Boolean.FALSE;
	}
	public boolean searchw(String a){
		Node cur=root;int i,j,len=a.length();
		for(i=0;i<len;i++) {
			if(cur.sub(a.charAt(i))==null) return Boolean.FALSE;
			cur=cur.sub(a.charAt(i));
		}
		return cur.e;
	}
	public String next(String pre){
		Node cur=root;int i,len=pre.length();
		for(i=0;i<len;i++) {
			if(cur.sub(pre.charAt(i))==null) return "";
			cur=cur.sub(pre.charAt(i));
		}
		String res="";
		if(i==len) {
			for(Node a:cur.c)
				res=res+String.valueOf(a.d);
			return res;
		}
		return "";
	}
	public LinkedList<Node> getnode(String pre){
		Node cur=root;int i,j,len=pre.length();
		for(i=0;i<len;i++) {
			if(cur.sub(pre.charAt(i))==null) return null;
			cur=cur.sub(pre.charAt(i));
		}
		if(i==len) return cur.c;
		return null;
	}
	public Vector<String>all=new Vector<String>();
	String tm;
	public Vector<String> getStrings(String pre){
		rec(pre);
		return all;
		//for(String f:all) System.out.println(f);
		//System.out.println(all.size());
	}
	public void rec(String pr){
		LinkedList<Node>nod=getnode(pr);
		for(Node n:nod){
			tm=pr+String.valueOf(n.d);
			if(n.e) {all.add(tm);}
			else {
				rec(tm);
			}
		}
	}
}

Levenshtein Distance:(How big the social network for a word is?)


Last week I came across this problem on codeeval.com 

Levenshtein Distance 

Description:

Two words are friends if they have a Levenshtein distance of 1 (For details see http://en.wikipedia.org/wiki/Levenshtein_distance). That is, you can add, remove, or substitute exactly one letter in word X to create word Y. A word’s social network consists of all of its friends, plus all of their friends, and all of their friends’ friends, and so on. Write a program to tell us how big the social network for the word 'hello' is, using this word list https://raw.github.com/codeeval/Levenshtein-Distance-Challenge/master/input_levenshtein_distance.txt

Input sample:

Your program should accept as its first argument a path to a filename.The input file contains the word list. This list is also available at https://raw.github.com/codeeval/Levenshtein-Distance-Challenge/master/input_levenshtein_distance.txt.

Output sample:

Print out how big the social network for the word 'hello' is. e.g. The social network for the word 'abcde' is 4846.
Now there are many ways of solving this as you might find on the internet. Most of them depend on finding the Levenshtein distance for calculating the neighbors of a word. But the problem I found that was it takes O(n2) time to calculate Levenshtein  distance(edit distance using dp) between 2 words. After this it all gets down to doing a DFS starting with 'hello'. This approach takes a lot of time to run. 
My approach for overcoming the problem with calculating the neighbors is as follows. Now we know that in this we need only the words that are at a distance of 1. Now these words are the once that can be obtained either by deleting a character, or replacing a character. So I simply simulated it using for loops and generating all possible words formed by deleting or adding or changing a letter and look it up in dictionary. For efficient storing of dictionary and lookup I used 2 dimensions. ie. An array of HashSet where HashSet at index i stores only words of length i.
You can find the full program here
Here is the logic for getting all words at distance 1:

public static void addwords(String wrd){
	int i,len=wrd.length();
	char z;
	StringBuffer sb;
	String b="";
	for(i=0;i<len;i++){
		sb=new StringBuffer(wrd);
		sb=sb.deleteCharAt(i);b=sb.toString();
		if(h[len-1].contains(b)){
			st.add(b);h[len-1].remove(b);
		}
	}
	for(i=0;i<=len;i++){
		for(z='a';z<='z';z++){
			sb=new StringBuffer(wrd);
			sb=sb.insert(i,z);
			b=sb.toString();
			if(h[len+1].contains(b)){
				st.add(b);h[len+1].remove(b);
			}
		}
	}
	for(i=0;i<len;i++){
		for(z='a';z<='z';z++){
			sb=new StringBuffer(wrd);
			sb.setCharAt(i,z);
			b=sb.toString();
			if(h[len].contains(b)){
				st.add(b);h[len].remove(b);
			}
		}
	}
}