Question

I'm building a small web search engine for searching about 1 million web pages and I want to know What is the best way to build the inverted index ? using the DBMS or What …? from many different views like storage cost, performance, speed of indexing and query? and I don't want to use any open source project for that I want to make my own one!

Was it helpful?

Solution

Perhaps you might want to elaborate why you do not wish to use F/OSS tools like Lucene or Sphinx.

OTHER TIPS

Most of the current closed-source database managers have some sort of full-text indexing capability. Given its popularity, I'd guess most also have pre-written filters for HTML so searching for something like <p> won't give 1000 hits for every web page.

If you want to do the job entirely on your own, filtering the HTML is probably the single hardest part. From there, an inverted index takes a lot of text processing, and produces a large result, but it's basically pretty simple -- you just scan through all the documents, and build a list of words and their locations (usually after filtering out extremely common words like "a", "an", "and", etc., that won't be meaningful search terms) then put those all together into one big index.

Given the size of the full index, it's often useful to add a second level index that's small enough that you can be sure it'll easily fit into real memory (e.g. restrict it to a few hundred entries or so). A really small (but somewhat ineffective) version just goes by the first letters of words, so the "A" words start at 0, "B" at 12345, "C" at 34567, and so on. That isn't very effective though -- you get a lot more words that start with "A" than with "X", for example. It's more effective to build your index, and then pick a few hundred (or whatever) words that are evenly spaced throughout the index. Then use that as your first-level index. In theory, you could get considerably more elaborate, such as something like a B+ tree, but that's usually overkill -- out of a million documents, chances are that you'll end up with fewer than a hundred thousand words that are used often enough to make much difference to the index size. Even at that, quite a few of the entries will be things like typos, not real words...

I think this book has your answer if you still looking for it.

http://nlp.stanford.edu/IR-book/information-retrieval-book.html

You may want to start with Hadoop. It will distribute your index building effectively over the cluster. You can use any language for it. Java and Python are recommended. Using Hadoop/MapReduce, you can easily index your web pages. But they will need to be cached/stored on a disk and you would require a parser/tokenizer to extract text first. There are some freely available parsers on the net. You can start from here if you want to do it manually. Once you have an index, then storing it is another task.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top