SEO Information

From Corpora to Matching

Making effective use of the Internet is increasingly about creating better and more intelligent applications and search engines. Here is a brief introduction into how search engines work:

01) Define the corpus, search space/data;
02) Separate the corpus into documents;
03) Generate features for each document;
04) Generate a representation of each document;
05) Study the feature/vector space;
06) Cluster documents;
07) Reduce dimensionality;
08) Accept input Queries;
09) Find the cosine angles against the query vector;
10) Find the sought vector column;
11) Output results to user in some way;

Each document in a corpus (database) is described by a set of keywords called index terms. We assign weights to index terms according to their relevance (frequency of occurrence for instance), this is how we go about creating the index, that we can then search.

Corpus preparation:
Web pages of interest are analysed and cleaned by removing hypertext tags or any other hyper language; Pages are then broken down into documents where each document is scanned through searching for words/terms of interest: those which make a document unique, not standard words.

Extract terms of interest:
Bear in mind that terms of interest must be invariant, that is be characteristic of a document, not generic and easy to find in any corpus/document. The idea is to find a signature per document.

Build term-by-document matrix:
The search space is defined by N dimensions where the chosen terms/features of a document is a point in the N term space, this allows conceptual/semantic searches.

Each document becomes a column vector, each row represents a term. Each row identifies the frequency of a term across the analysed corpus, at first we simply build the matrix by counting the terms for each document.

Compress the matrix:
There are two basic techniques/methods, Compress Row Storage(Scans matrix row by row) and Compress Column Storage (Scans matrix column by column) Both use three arrays.

Normalis the matrix:
Normalisation implies transforming column vectors to unit vectors: i.e. vectors of unit length

Unit document vectors contain frequency of terms; the normalisation is applied because the semantic content of a document is generally determined the relative frequency of terms.

Singular Value Decomposition:
This simplifies a symmetric matrix into three matricesTwo are identical and represent the eigenvectors: the new dimensions. The third is diagonal and represents the eigenvalues, that is the spread of the corpus along these new dimensions.

A geometric interpretation:
The corpus is first formated, stemmed and is then stored in a compact term-by-document matrix. Each column of such matrix is then normalised to produce the likelihood of a term across the corpus, or, equivalently, the frequency of terms in a document.

The term-by-document matrix is then decomposed to calculate eigen values and vectors. Eigen vectors represent a new Cartesian coordinate frame spanning the same search space, BUT, they indicate the most important dimenions/axis along which documents mainly lie. Eigen value do quantify the spread of documents along these new axes/eigen vectors.

Queries:
Queries must be based on defined features/terms within the term-by-document matrix, matching in a vector space such as this is implemented by multiplying the query vector against the terms by document matrix,ie matching a query vector q against the documents of the matrix.

© I am the website administrator of the Wandle industrial museum (http://www.wandle.org). Established in 1983 by local people determined to ensure that the history of the valley was no longer neglected but enhanced awareness its heritage for the use and benefits of the community.

MORE RESOURCES:
Unable to open RSS Feed $XMLfilename with error HTTP ERROR: 404, exiting

RELATED ARTICLES

Keywords, Ranking, & Search Engine Optimization Fun
I am a Search Engine Optimization newbie. I have read alittle on various forums, browsed a few articles, and readthrough The Affiliate Masters Course (Ken Evoy) a couple oftimes.

An Ethical Alternative to Doorway Pages
Definition: A doorway page is content created specifically for the purpose of garnering high placements in the search engines.Issue: Google makes the following specific recommendation: Avoid "doorway" pages created just for search engines or other "cookie cutter" approaches such as affiliate programs with little or no original content (http://www.

Google Page Rank - Important Or Just Another Number?
In my last newsletter I wrote about how your websites Alexa rating is not actually that important to the success of your online business. In this issue, I want to look at another popular statistic - Google Page Rank - and ask a similar question - is it that important?First a quick overview as to what the Google Page Rank actually is.

Googles Next Big Move
November 2003 might go down in history as the month that Google shook a lot of smug webmasters and search engine optimization (SEO) specialists from the apple tree. But more than likely, it was just a precursor of the BIG shakeup to come.

Googles PR System Explained
The complexities of Google's PR (Page Ranking) System have grown more difficult to understand since the Hilltop Algorithm was introduced. This beginner's guide to the PR system explains the basics of what PR is, what it does, and how it affects your site's rankings.

The Search Engine Of The Future: Mobility Has A Pricetag
We all know the ease of using a search engine to garner information from around the world in microseconds. We've gotten accustomed to accessing these for the price of our hookup to the internet, be it dial-up, cable or wireless.

13 Tips For Good Search Engine Placements
When used properly in combination with other basic search engine tactics, these tips can help to dramatically improve your placement with the search engines and increase the traffic to your web site.1.

Increase Your Page Rank Through SEO
Search Engine Optimization (SEO) must be considered a process and over time you can build your ranking and traffic.Remember, "Rome wasn't built in a day.

How To Boost Your Keyword Density On Your Web Site To Gain Top Positions At The Search Engines
Let's talk about what keyword density is and how to improve your keyword density on your web site. To improve your keyword density ratio there are three parts that we will need address.

How to Choose Keywords to Theme Your Pages and Boost Your Traffic
One of the most frequent questions I get asked is in the choice of keywords for a new site, specifically secondary keywords..

Beyond Search Engines
Some webmasters report that search engines account for 75% or more of their total website traffic. However, it's important not to become too dependent on search engines for new business.

Keywords, Choose Them Wisely
By now you have likely heard that keywords and keyword phrases, are extremely important in having search engines display your website. So how do you choose them? Guess? Ask a friend? Check successful competitors sites? There is a better way!First let's digress.

Fresh Content Improves Search Engine Optimization
Many search engine optimization companies will sell you a search engine optimization package that addresses many of the major aspects of search engine optimization. These aspects include, but are not limited to, use of file names, alt tags, h1 tags, keyphrase density, meta tag optimization, link analysis and the like.

Improve Search Engine Rankings - The Real Deal!
Ok, here's the deal, follow these steps and shoot me if your rankings doesn't improve. I know that there's been so many articles on how to improve your search engine rankings but most of them are either incomplete or untrue.

Tops In Toolbars?
Most internet marketers are aware of, and probably use, the Google Toolbar. After all, it has been the only indicator of Google's PageRank number that has been assigned to a given web page.

A Way for Search Engines to Improve
Wouldn't it be nice if the search engines could comprehend our impressions of search results and adjust their databases accordingly? Properly optimized web pages would show up well in contextual searches and be rewarded with favorable reviews and listings. Pages which were spam or which had content that did not properly match the query would get negative responses and be pushed down in the search results.

PageRank for Websites: Is There More to the Web?
Google's PageRank has been around for years, and in the opinions of a lot of e-business owners, it can make or break a site. Lately, with Google's fingers in every pie, it seems important to remind everyone that there is more to a website than just PageRank.

Tread Towards A Successful "Internet Research"
Internet is a terrific resource containing billions of web pages dedicated to thousands of topics. Since the amount of information available on the Internet is so vast and mind baffling you may feel lost.

Developing A List Of Keywords For Marketing
Keywords aren't just some words that allow search engines, like Google, to find your web site. They are also key elements for creating attractive language to use in your marketing or advertising material.

Here Today Gone Tomorrow
Its a matter of here today, gone tomorrow at the moment with Google so don't get complacent if you are on the first few pages of the searches. Its great, you are making money and think now you can relax and forget about it for a while.

home | site map | contact us