phrase
 
What can this search engine do for me that Altavista and Yahoo can't?

Altavista and Yahoo are millions of times more comprehensive and more powerful than this little thing, but their power has a downside too: sometimes they are too good. You will realise this if you try to find what anti-CoS pickets are planned in Atlanta or what the Pilot has said about the bridge. No matter how you limit your search, Altavista will return results for anti-abortion pickets in Atlanta and for airline pilots and bridges over rivers. Yahoo will not, but Yahoo forces you to click your way through a maze of categories and is slow in adding pages to its index. If you are looking for a picket next week, you will always find it three weeks too late on Yahoo.

But this is not all. This search engine can currently also help you find broken links on your pages or on those of others. Just click "Find 404s" from the main menu and see which links you need to fix. You can search for broken links on a particular website or in general. In the future it might be possible to see the referring pages to broken links. Until then, use Altavista. Find the 404s to your pages here, then go to Altavista and search for "link:bad_url" without "http://". That will return pages that link to your non-existing page, so you can mail the webmasters involved.

And there is more. While indexing, this spider downloads and keeps a local copy of all pages that it indexes. If a site is indexed, then a mirror of it has been made too. To know more about this function, read further down about mirrors.

How do I search?

  • Type your search terms.
  • Choose from the drop-down menu whether you want to search for all terms, any terms or boolean. It is usually better to start searching for all terms and then expand the search to any terms. Default is all.
  • If you choose boolean search, you need to use operators. The search engine accepts the following:
    • & (ampersand) means AND.
    • | (pipe) means OR.
    • &~ (ampersand-tilde) means BUT NOT.
    Use the symbols. The words AND, OR and NOT are not recognised as operators.
    You can group search terms with parentheses. Here are some examples:
    To search for pickets in general except revenge pickets you would type "picket &~ revenge".
    To search for pickets in Clearwater or Los Angeles but not revenge pickets you would type "picket & (clearwater | angeles) &~ revenge"
    To search for any pickets in Clearwater and only critics' pickets in Los Angeles you would type "(picket & clearwater) | (picket & angeles &~ revenge)".
    You get the gist, so just try it.
  • You cannot use wildcards. If you would try to search for "sciento*" you would find all the pages that contain the word "sciento" itself.
  • Hyphenated words in pre-formatted text are treated as two words. Therefore, the word "scientology" will not be indexed in the following example:
    ...faithful to its traditions of harassment tactics, sciento-
    logy sued...
    Instead, the words "sciento" and "logy" will be found in the index.
  • Currently the search engine does not support search for exact phrases either. Enclosing phrases in quotes is meaningless.
  • Decide which category of sites you want to search in. Default is all, but you can limit your search to critics' sites only or free zone sites only.
  • You can search in the meta-tag page description, in the meta-tag keywords, in the page title, in the page body or in all of the above. Default is all. Pages that contain your search term in the meta-tags or in the URL are ranked higher and appear first in the results.
  • The "results per page" and "output format" options are self-explanatory.
How do I check if my site has been indexed?

Hit "Check index" on the main menu and fill in an exact URL. If it is indexed it will be returned to you; if not, you will get an empty results page. If your main entry page is indexed, then all pages directly or indirectly linked from it are likely to have been indexed too.

How do I submit a site for indexing?

Hit "submit" on the main menu and give a URL to be indexed. But please, read the following paragraph first.

How can I make the indexing of my site easier?

To do this you need to understand how the spider works.

The search engine can be told to index an entire domain, a directory or a single page. As you can imagine, it does not understand what it reads; it just reads words and puts them in a database. However, in order to maintain the relevancy value of the database, no irrelevant information should be allowed to be indexed.

Some sites are completely dedicated to the subject of scientology. Examples are www.xenu.net and www.fza.org. Everything within these sites is relevant to this search engine and indexing these sites is very easy: the spider can just walk them and index all the links in them.

Other sites are more complicated. Some people have part of their site dedicated to scientology and use other parts to publish information about themselves, their work or their hobbies. Ideally the different subjects are grouped in subdirectories and have an index of their own. For example, a site can contain
    www.provider.com/~user/myself/
    www.provider.com/~user/mywork/
    www.provider.com/~user/myhobby/
    www.provider.com/~user/cos/
where the index at www.provider.com/~user/index.html points to individual indices in the subdirectories, such as www.provider.com/~user/cos/index.html and those in turn link to all the pages in that particular subdirectory. In this example, telling the spider to index
    www.provider.com/~user/cos/
would yield the right results.

Unfortunately, not all people are that orderly. All too often are websites "arranged" in this manner:
    www.provider.com/~user/index.html
    www.provider.com/~user/myself1.html
    www.provider.com/~user/myself2.html
    www.provider.com/~user/myself3.html
    www.provider.com/~user/myhobby1.html
    www.provider.com/~user/myhobby2.html
    www.provider.com/~user/myhobby3.html
    www.provider.com/~user/mywork.html
    www.provider.com/~user/cos1.html
    www.provider.com/~user/cos2.html
    www.provider.com/~user/cos3.html

From the point of view of a subject-specific spider, this is a disaster. Indexing
    www.provider.com/~user/
will spider all the pages in the directory, including those about work and hobby, which lowers the relevancy value of the index. Pointing the spider to
    www.provider.com/~user/cos1.html
doesn't help, because there is usually a link to "home" there that, if followed, will lead to the indexing of all the pages anyway. The only solution here is to manually feed the indexer all the relevant individual pages, i.e.
    www.provider.com/~user/cos1.html
    www.provider.com/~user/cos2.html
    www.provider.com/~user/cos3.html
and tell it to not follow any links at all. This implies a lot of work and it is not fail-safe in case the site content is changed.

The same problem arises when the subject of a site is not scientology, but something else. The scientology pages in the site
    www.provider.com/~user/cults/
need to be separated from those on other cults, so they still need to be indexed manually.

Thus, in order to make life easier for everybody, including yourself, please group all your scientology-related pages in one dedicated directory and put an index in it. Whether you choose to group them futher within the base directory or not does not affect the spidering. For example
    www.provider.com/~user/cos/
    www.provider.com/~user/cos/skriptures/
    www.provider.com/~user/cos/pickets/
    www.provider.com/~user/cos/news/
is just as easy to spider as
    www.provider.com/~user/cos/
just as long as the /cos/ directory is clean from non-scientology-related material. Furthermore, a site like this can be automatically re-indexed every once in a while; new pages will be found and obsolete pages will be removed from the index in a clean and useful manner.

So, if you are about to submit your site for indexing, please take the opportunity to check it first and adjust it where necessary. If your site looks like the bad example above and you don't want to do anything about it, then please submit all the individual pages that you want indexed and repeat the procedure every time you add, delete or change a page. Such pages will always be at the bottom of the todo list of the spider.

So what's that about mirrors?

As mentioned earlier, while the spider downloads a page in order to index it, it also keeps a local copy. This can come very handy if a site is suddenly taken down for the usual reasons, or if the maintainer of a site happens to accidentally delete it and, as always in these cases, Murphy has taken the backup.

There is one important limitation though: the spider only downloads pages that will be indexed, i.e. text. It keeps copies of all .htm*, .txt, .php* etc files, but it does not download images, sounds, or any other non-indexable files. This might change in the future, but currently all that is saved is text.

For various reasons the mirrors are not publicly available. If you want a copy of a site, you need to mail me.

Swell, but who is mirroring the search engine itself?

Nobody, so it could go down at any time. A search engine has many enemies even if you don't count the CoS: anything from hardware failures to unpaid bills. If you would like to mirror it, you are most welcome. It will run on practically any UNIX variant and all the software you need is free. To create a mirror, install MySQL, then install mnogosearch, and then mail me for a copy of the database and the HTML. If you have some experience you can have it up and running in less than half an hour. After all, you don't need to configure anything; you will get a ready configuration with the mirror.

Zenon Panoussis




Searchmaster