Recently, you've been seeing various notes from me regarding issues about how search is performed on our website. Recently, we switched from Google to Bing because we thought it did a better job indexing REBOL 3 documentation. Now, after a short period, I'm dissatisfied with Bing as well.
I don't know how you find your REBOL related documents, but I use the search field in the upper right side of our web pages. Unfortunately, and perhaps it's just because I know what pages actually exist, search results often don't show the page I know to be the best result.
The REBOL.com website contains 10'794 pages (and that does not include the pages on REBOL.net, REBOL.org, or various other REBOL websites.) Perhaps that's simply beyond the capabilities of the public search engine systems? Meaning that they internally limit how many pages they index for a site. Probably makes sense, from their perspective.
I can give you an example of this problem. Using Google's webmaster tools, I see that Google has indexed only 217 of the 851 pages in the REBOL 3 documentation. Yes, it has an XML-based sitemap which you can see is accurate and up-to-date. Google reads it successfully and flags no errors.
I suppose it might be possible to divide the site into separate sub-domains... assuming search engines do better that way (which may not be true, I don't know.) There could be a docs.rebol.com, blogs.rebol.com, and downloads.rebol.com, etc. Again, not sure if that would help or not.
Perhaps an even better solution would be to use the RIX search engine which did a good job indexing REBOL in the past, but I'm not sure if it's supported these days (and it's gotten a bit on the slow side.) Does anyone have information about RIX or know how to contact its author?
Well, that's the situation. Let me know if you have some ideas or insights.
If you use a Linux box, take a look at the www.dtsearch.com engine.
Or some open-source full-text systems. Even SQLite has a FTS.
Overall, I wouldn't spend a lot of time trying to fix Google etc. No chance, than DIY.
Stacey Abshire 7-May-2010 15:33:41
Write your own... It's not that difficult.
Graham 7-May-2010 21:49:58
probably not your thing
http://omnifind.ibm.yahoo.net/productinfo.php
tomc 8-May-2010 0:29:17
gee, wish I had the problem of not enough bots ...
careful of what you ask for ;)
that said, public search engines do not(and probably can not) generically return pages in the best context of each specific domain.
So for many years we just had our own database searches and as the volume of free text to scan increased we added
Lucene and put work into making in ontologicaly aware.
so that searches know that the knee bone is connected to the .. leg bone and the ...
Lucene has worked pretty well, but we have a well defined idea of what we consider important.
my google experiance is after an outcry we relented an allowed google to spider parts our site, a year later the verdict is in bots out number humans by orders of magnitude real users coming in from google are less than 5% and of that 5%, half just typed our domain name into the google search box instead of the url bar... and in return we get to deal with problems from too many bots.
Luis. 10-May-2010 1:49:05
Hiya,
Why not keep the general website (ie: the public facing pages) managed through DocBase but tagged in a Rebol friendly search engine way. Have then deny robots.
Generate a batch of pages with the Rebol search engine (or 'Rebol filing engine') to generate public search engine friendly pages, which _do_ allow robots.
Essentially: Generating, via Rebol, search engine friendly pages. The content of the search being under the control or the Rebol engine generating the pages from DocBase.
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=40349 has a link to a Starter Guide pdf if it helps.
Cheers,
Luis.
John 10-May-2010 16:06:43
Carl,
No offense but it seems like you'd have more important items to deal with ( version 3) than how someones else's engine deals with your docs.
Maybe I'm miss-understanding the importance of what you're stating or the issue (I've done it numerous times before).
Just get a google mini ($2000) and use it internally for searches against your docs.
Or use this as a challenge to some of the guru's to come together and create one.
Again, no offense meant.
Janko M. 11-May-2010 1:33:23
There is an excellent and quite simple to integrate with (over HTTP) search engine Apache SOLR (Uses Lucene indexer). I implemented it at 3 clients and know many huge projects that rely on it.
Rebol.org also has a very good and fast search engine made in REBOL. I talked to I think Sunanda one time about it and I was highly impressed.
Sunanda 11-May-2010 5:56:44
I commented about the REBOL.org indexer/search engine in the Bing Blog....My response may have been overlooked:)
I think it has potential to implement a search on REBOL.net and/or .com.
Hallvard 11-May-2010 9:46:09
RIX was written almost a decade ago, but true, it's still there... On an old PPC mac. It's often slow, depending on other tasks going on and off on an often overloaded test server.
I haven't been doing maintenance on RIX. I'd be happy to share or at least show the code, but just now (maybe simply beacuse the interest was suddenly there?), I don't feel like having it run on another server... Or maybe that's OK, I'll think about it, if the question arises.
Anyway, there's MySQL on the other end, DocKimbel's mysql engine inbetween, and rebol 2.7.7 on the server.
I'll send you the code to look at, Carl, if you'll tell me what address to use. You'll reach me on rix (at) babelserverdotorg.
Graham 11-May-2010 12:26:40
I setup the search engine I referenced above on to an IBM cloud server.
See http://129.33.196.33
Currently about 25k pages are indexed but many of these are virtual pages from the wiki on rebol.net ... so needs some optimizing on what to index.
Carl Sassenrath 12-May-2010 12:12:21
First, thanks for the suggestions. My desire is not to spend time solving this problem myself, but to find a solution such that I could simply point the REBOL.com search field to another dependable URL. (Meaning, it's a service that's not going to vanish in a few months, as so many do.)
My motivation here is that for R3 development those of us involved in the project search the docs more and more for specific phrases and words. When searches come up empty for patterns we know are there, our confidence in the bing/google search method is diminished.
Now, I'll check you your links above, and write a few comments. Thanks for the replies.
Carl Sassenrath 12-May-2010 13:05:47
The above suggestions fall into these responses:
Operating third party software on our server: Frankly, I don't have the time; it's just another service to support, libs to resolve, configs to manage, DB's to setup, etc.
Writing it ourselves: Well, we actually have one, but it's hidden from most users because it's brute force. Back 10 years ago, Bo Lechnowsky also wrote one for REBOL tech with better performance (hashed to bitmaps for fast search with low overhead). But, not sure where that code is, and don't have time to mess with it.
Linking to third party software on another server: That's good, as long as it does what we need and is reliable. So, evaluating above suggestions, in reverse order:
IBM Omnifind (on Graham's server): I searched for REBOL Extensions, and it didn't find the main R3 docs (but, it looks like it's not indexing rebol.com at all, so it wasn't a fair test).
RIX (Hallvard's system): Been around a long time, finds the correct R3 docs, but very slow and does not eliminate duplicate links. Also, not sure about the importance of docs that follow the top hits. (This is an old system, see Hallvard's comments above.)
REBOL.org Search (Sunanda's system): It was created for nearly the same reason: indexing "many pages is not the same as all pages". This method is a potential solution, depending on what it requires to make it work.
And, actually, the more I think about it... it seems likely to me that we're missing something. There's definitely a market opportunity being missed by google, yahoo, and bing... and those are almost always filled in by entrepreneurial ventures. "Real custom search." I bet there's one out there, and we've just not hit it yet.
Graham 12-May-2010 15:36:22
Carl, try again ... I've got it indexing rebol.com now, and it does find rebol extensions.
37.5k pages indexed so far.
Kev 12-May-2010 16:17:24
Probably readers a search for a finite number of issues.
If so, the search engine is often repeating the same searches over and over again.
This could be proved by recording the search string submitted to the search engine on the Rebol.com site.
If, for example, an online search engine were used to search for "Draw" on the Rebol.com site, the results
could be retained and analyzed, urls added and removed, and ranked. Rebol 3 related documents could be
separated. A "Draw" url reference page would be created. This human edit would result in a focused and relevant set of pages. An index on a web page would point the way to the url set for the issue.
It would take some effort, but it may not have to be done that often. New pages would routinely be added
to the subject url index page by authors.
Volunteers from the Rebol community might have a shot at this. Documents not discovered by volunteers
would have to be added by Rebol.com.
Maybe try just one commonly searched issue to see how it goes.
Luis. 13-May-2010 1:34:29
That's similar to what I was proposing, but automated (aside from the tags initially created manually).
Cheers,
Luis.
Graham 15-May-2010 13:38:49
Omnifind is now up to 90,000 pages, and indexes rebol.com, rebol.org and rebol.net
You can restrict your search to a particular site either using the advanced search option or use site:
Eg. site:rebol.com construct
See http://129.33.196.33
croquemitaine 21-May-2010 10:55:06
Why not using meta engines like metacrawler or Ixquick.
Ixquick doesn't keep trace of id or ip of user.Think about it.
Arnold 22-May-2010 5:20:37
rebol.com/docs directory is 403 forbidden.
If it was not, would search engines crawl through it better?
Maybe open up /docs for a bit, I tried to read it to see what might be there that I was not seeing in the search engines - and got nothing.
Carl Sassenrath 25-May-2010 17:33:45
Graham: it's looking better... I'll direct the site search there, thanks.
Arnold: it's ok for docs/ to be 403, because nothing points to that url (at least, nothing should point to it.)
Nada 25-May-2010 19:43:54
That new web search is not very good (the one from ibm/yahoo now being tested on the main site). Search for "tutorial" and it does not find the main tutorial page on rebol.com, and search for "tutorials" and you get different result. It does not understand pluralization. Then look at the quality of the hits at the top to find out that they are useless results such as pages on the email list that have nothing to do with tutorials. Go back to the old method please.
Graham 26-May-2010 0:00:53
The search should be modified a little .. since this search is for rebol.com only, I would suggest fixing it so that instead of just searching on "tutorial" , also add the command site: by default
eg. site:rebol.com tutorial
And you'll get the right result.
Carl Sassenrath 26-May-2010 7:40:25
Restored original Google-based search. Above comments point out issues with other engines. Also, google search has better site integration (e.g. template, head/tail nav-bars, CSS).
Graham, thanks for trying out the IBM/Yahoo search engine. Unfortunately, as noted above, it looks like that engine needs work. The note above about the quality of the top hits is particularly relevant.
Brett 10-Jun-2010 9:20:47
Carl, I've had hit and miss with Google's indexing of customers sites over the years and believe it or not the best thing you can do is WRITE TO THEM and tell them about it - they're a lot more receptive to this sort of feedback than people might imagine. If they can help you achieve 100% indexing, they'll do it and they'll do it without trying to bill you for their time.
Dick Dunbar 4-Mar-2011 23:09:11
I have good luck with www.dogpile.com
It seems to learn what people are really searching for.