How can I make the indexing of my site easier?
To do this you need to understand how the spider works.
The search engine can be told to index an entire domain,
a directory or a single page. As you can imagine, it
does not understand what it reads; it just reads words
and puts them in a database. However, in order to
maintain the relevancy value of the database, no
irrelevant information should be allowed to be indexed.
Some sites are completely dedicated to the subject of
scientology. Examples are www.xenu.net and www.fza.org.
Everything within these sites is relevant to this
search engine and indexing these sites is very easy:
the spider can just walk them and index all the
links in them.
Other sites are more complicated. Some people have
part of their site dedicated to scientology and use
other parts to publish information about themselves,
their work or their hobbies. Ideally the different
subjects are grouped in subdirectories and have
an index of their own. For example, a site can
contain
www.provider.com/~user/myself/
www.provider.com/~user/mywork/
www.provider.com/~user/myhobby/
www.provider.com/~user/cos/
where the index at www.provider.com/~user/index.html
points to individual indices in the subdirectories,
such as www.provider.com/~user/cos/index.html and
those in turn link to all the pages in that particular
subdirectory. In this example, telling the spider to index
www.provider.com/~user/cos/
would yield the right results.
Unfortunately, not all people are that orderly. All
too often are websites "arranged" in this manner:
www.provider.com/~user/index.html
www.provider.com/~user/myself1.html
www.provider.com/~user/myself2.html
www.provider.com/~user/myself3.html
www.provider.com/~user/myhobby1.html
www.provider.com/~user/myhobby2.html
www.provider.com/~user/myhobby3.html
www.provider.com/~user/mywork.html
www.provider.com/~user/cos1.html
www.provider.com/~user/cos2.html
www.provider.com/~user/cos3.html
From the point of view of a subject-specific spider,
this is a disaster. Indexing
www.provider.com/~user/
will spider all the pages in the directory, including
those about work and hobby, which lowers the relevancy
value of the index. Pointing the spider to
www.provider.com/~user/cos1.html
doesn't help, because there is usually a link to "home"
there that, if followed, will lead to the indexing of
all the pages anyway. The only solution here is to
manually feed the indexer all the relevant individual
pages, i.e.
www.provider.com/~user/cos1.html
www.provider.com/~user/cos2.html
www.provider.com/~user/cos3.html
and tell it to not follow any links at all. This implies
a lot of work and it is not fail-safe in case the site
content is changed.
The same problem arises when the subject of a site is not
scientology, but something else. The scientology pages
in the site
www.provider.com/~user/cults/
need to be separated from those on other cults, so they
still need to be indexed manually.
Thus, in order to make life easier for everybody, including
yourself, please group all your scientology-related pages
in one dedicated directory and put an index in it. Whether
you choose to group them futher within the base
directory or not does not affect the spidering. For example
www.provider.com/~user/cos/
www.provider.com/~user/cos/skriptures/
www.provider.com/~user/cos/pickets/
www.provider.com/~user/cos/news/
is just as easy to spider as
www.provider.com/~user/cos/
just as long as the /cos/ directory is clean from
non-scientology-related material. Furthermore, a site
like this can be automatically re-indexed every once
in a while; new pages will be found and obsolete pages
will be removed from the index in a clean and useful
manner.
So, if you are about to submit your site for indexing,
please take the opportunity to check it first and adjust
it where necessary. If your site looks like the bad
example above and you don't want to do anything about it,
then please submit all the individual pages that you
want indexed and repeat the procedure every time you
add, delete or change a page. Such pages will always
be at the bottom of the todo list of the spider.
So what's that about mirrors?
As mentioned earlier, while the spider downloads a page
in order to index it, it also keeps a local copy. This
can come very handy if a site is suddenly taken down
for the usual reasons, or if the maintainer of a site
happens to accidentally delete it and, as always in
these cases, Murphy has taken the backup.
There is one important limitation though: the spider
only downloads pages that will be indexed, i.e. text.
It keeps copies of all .htm*, .txt, .php* etc files,
but it does not download images, sounds, or any
other non-indexable files. This might change in the
future, but currently all that is saved is text.
For various reasons the mirrors are not publicly
available. If you want a copy of a site, you need
to mail me.
Swell, but who is mirroring the search engine itself?
Nobody, so it could go down at any time. A search engine
has many enemies even if you don't count the CoS: anything
from hardware failures to unpaid bills. If you would
like to mirror it, you are most welcome. It will run
on practically any UNIX variant and all the software
you need is free. To create a mirror, install
MySQL,
then install
mnogosearch,
and then mail me
for a copy of the database and the HTML. If you have
some experience you can have it up and running in less
than half an hour. After all, you don't need to
configure anything; you will get a ready configuration
with the mirror.
Zenon Panoussis