Message board search engine
12-23-2008, 07:09 PM
|
Message board search engine
|
Posts: 922
Name: Geoff Vader
Location: In my dreams
|
Is there anything intrinsically wrong with setting up a directory of message boards, along with a daily/weekly list of the top posts on all the boards, and to make the data at least slightly searchable?
I have a list of 1000 different consumer boards (and a further 200 political ones, which I may not use any time soon, since politics seems best kept out of technology and selling and living and eating and romance, and pretty much everything, really) and I can easily set up software to crawl them all and compile a site, overnight.
I may actually just do it in an hour or less. Anyone got any ideas? Would it be a good thing? Imagine if you can say to yourself that you want to ask someone a question on some important topic, so you go to my directory/search and you find, by searching or by browsing, where people are talking about exactly what you want to know about - it's far more accurate than some mainstream solution like google because the forums would be diluted by tonnes of places with information published in other ways, so they're spread out and in order of the most mainstream not in order of the most relevant deep-specific match...
I may find it useful myself! The politics one wouldn't work at all, since there are about 3 things people search for relating to politics and that's ALL they talk about all the time and there'd be nothing to match, there'd be no utility - look for war, economy, bunch of crap, or anything else that is one of the 3 topics of political discussion, search across 200 politics sites and you'll get 200 matches for each.
but on consumer forums ranging from sporting activities to specific musical or other fmcg markets... the number of things you could find out quickly and efficiently could be quite broad - ALL 1000 forums I have are branded by google as "UK" sites, although that may not be true in all ways. But they were all found with the Uk results radiobox checked.
The majority of them are also visibly UK, so it's still a heavily UK network of communities - the total number of users they have between them is probably around a few million, since they are mostly forums with at least 1000 members, and many with considerably larger member bases.
Of course I could also crawl their archives and make a fullscale search facility for searching all of them backwards across time, which would also be useful, but that would take weeks, and the need for more servers, to set it up properly.
I would eventually be able to have at least 10,000 forums on it in the UK, a large sample of blogs (I don't have any size estimate yet), and of course about 50,000+ forums in the USA and a very large number of USA-based blogs.
I might not make the initial one tonight after all. If it's going to extend that far I may as well do it slowly and carefully. So if anyone has any ideas, do add them... good ones will get included when I get the thing done!
Then again since perl programmers allegedly do strange things in the night with soldering irons and mythical dragons (or whatever it was in the religious programming jedi thread) I COULD write a bit of a mad set of shell and perl scripts tonight, take the 1000 urls and turn it into a directory and search engine overnight, feeding it instantly into google and letting its multitudes of relevant useful well-linked pages get massively absorbed into googletopia and spread like a highly-spreadable thing... butter at the right temperature, on perfect-coloured toast (perfection is subjective).
I may just have some toast now. It may be 11.06pm but the power of suggestion often supercedes the rule of logic.
|
|
|
|
01-13-2009, 08:53 AM
|
Re: Message board search engine
|
Posts: 922
Name: Geoff Vader
Location: In my dreams
|
Since I've been reprieved of my sentence of spamming, I have so much time on my hands i feel strongly inclined to build the search engine above. For now all I'm going to do is get a crawler to visit all the forums and try to just make a list of the post titles on their front pages, ignoring all else, and see where that leads.
Experimentation is always much more fun than work which is proven to succeed and requires nothing more than bullish subservience to daily labour schedules. Thank god I'm free from the "bagne" - the prison camp, the labour camp... modern babylon.
I studied the Holocaust in great depth when i was at Oxford, indeed I cried at the horrors my jewish brethren faced at the hands of nazis, and indeed anyone else the germans managed to lock up in their camps... but I cry more when I see that primo levi (a survivor)'s words are so woefully contradicted - he said something like "don't make this happen in your own homes, whatever you do" - and all I see is middle class kids I knew, slaved for a dollar, the leash around their neck still tight even though the metal its made from has gone from gold, to silver, to copper.
People here should read Se Questo e un uomo, by primo levi, if they want to know why it's not a good idea to join the ratrace in our times. I wrote a poem about it...
I took the words at the beginning of se questo e un uomo, when I was about 20 years old, and I wrote this thing instead, based on the lives of my asian and black friends in the ghettoes of london
Quote:
TO MOISHA GOLDSTEIN
You who live safe,
In the warmth of Western Capitalist Democracy,
You who find, returning in the evening,
Fast food and obsequious serving people:
Consider if this is a man
Who works behind the counter
Who does not know peace
Who fights for a minimum wage he's already spent
Who becomes unemployed for a bag of fries.
Consider if this is a woman,
her name pinned to her chest,
Her head filled with the names of different meal deals
Her body permeated by lardy vapours
Like a greasy breakfast.
MEDITATE that this goes on.,
I commend these words to you.
Carve them in your mind,
At home, in the street,
Going to bed, rising;
Repeat them to your children
So that they know what you've done.
|
Anyway. I hope to be back with a relatively well-working search engine later today... fingers crossed. First, though, I have to go out and go for a long walk and stave off the growing boredom. I can't even spend money - every last penny of mine is now in the money machine and I'm going to keep it that way until long after the bailiffs give up trying to break into my house to take away the mouse traps (there's nothing else of value... it's illegal for them to touch my computer - it's for my job). I've had bailiffs trying to trick me into letting them in at 7pm, 8.30am... who knows what's next. These people really don't have a clue who they're dealing with.
|
|
|
|
01-13-2009, 09:16 AM
|
Re: Message board search engine
|
Posts: 922
Name: Geoff Vader
Location: In my dreams
|
Right now my server is already busy crawling 1000 forums - I thought why wait. What's later on got that now doesn't? (Don't you wish more women felt that way, if you're a guy?)
Anyway. Here's the code. First I wrote this perl script...
Code:
#!/usr/bin/perl
open (file,"<forums.txt");
open (out,">forumer2.sh");
print out "#!/usr/bin/sh\n\n";
while ($line=<file>){
chomp $line;
$count++;
print out "wget \"$line\" -O \"forumdata/$count\"\n";
}
close (file);
close (out);
I then made a folder called forumdata
then I ran the above script by saying perl forumer1.pl (on the commandline)
This produced my primary shell script...
Code:
#!/usr/bin/sh
wget "http://www.frugallerforum.co.uk/index.php" -O "forumdata/1"
wget "http://www2.mensfitnessmagazine.co.uk/forum/" -O "forumdata/2"
wget "http://www.handbag.com/vbulletin/" -O "forumdata/3"
wget "http://forum.myprotein.co.uk/" -O "forumdata/4"
wget "http://www.cyclechat.co.uk/forums/index.php" -O "forumdata/5"
wget "http://www.mfat.co.uk/forums/" -O "forumdata/6"
wget "http://vbulletin.thesite.org/index.php" -O "forumdata/7"
wget "http://www.musclesoc.com/index.php" -O "forumdata/8"
wget "http://www.shell-livewire.com/forums" -O "forumdata/9"
wget "http://www.silkysteps.com/forum/" -O "forumdata/10"
wget "http://www.p8ntballer-forums.com/vb/index.php" -O "forumdata/11"
wget "http://forums.overclockers.co.uk/index.php" -O "forumdata/12"
wget "http://www.netmums.com/coffeehouse/index.php" -O "forumdata/13"
wget "http://www.wiids.co.uk/boards/index.php" -O "forumdata/14"
wget "http://www.ps3forums.com/index.php" -O "forumdata/15"
wget "http://www.falconryforum.co.uk/index.php" -O "forumdata/16"
wget "http://www.healthypages.co.uk/forum/index.php" -O "forumdata/17"
wget "http://www.fencingforum.com/forum/index.php" -O "forumdata/18"
wget "http://www.northstandchat.biz/index.php" -O "forumdata/19"
wget "https://www.fightstuff.co.uk/forums/index.php" -O "forumdata/20"
wget "http://www.whatsontv.co.uk/forums/index.php" -O "forumdata/21"
wget "http://board.dogbomb.co.uk/index.php" -O "forumdata/22"
(etc, etc...)
which is currently running and gathering the basic page data
Quote:
sh forumer2.sh
--14:04:29-- http://www.frugallerforum.co.uk/index.php
=> `forumdata/1'
Resolving www.frugallerforum.co.uk... 83.170.72.84
Connecting to www.frugallerforum.co.uk[83.170.72.84]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
[ <=> ] 87,771 --.--K/s
14:04:39 (2.54 MB/s) - `forumdata/1' saved [87,771]
--14:04:39-- http://www2.mensfitnessmagazine.co.uk/forum/
=> `forumdata/2'
Resolving www2.mensfitnessmagazine.co.uk... 194.70.234.208
Connecting to www2.mensfitnessmagazine.co.uk[194.70.234.208]:80... connected.
HTTP request sent, awaiting response... 200 OKCookie coming from www2.mensfitnessmagazine.co.uk attempted to set domain to 2.mensfitnessmagazine.co.uk
Cookie coming from www2.mensfitnessmagazine.co.uk attempted to set domain to 2.mensfitnessmagazine.co.uk
Length: unspecified [text/html]
[ <=> ] 22,640 --.--K/s
14:04:49 (1.31 MB/s) - `forumdata/2' saved [22,640]
--14:04:49-- http://www.handbag.com/vbulletin/
=> `forumdata/3'
Resolving www.handbag.com... 92.122.208.248, 92.122.208.200
Connecting to www.handbag.com[92.122.208.248]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
[ <=> ] 36,609 --.--K/s
14:04:53 (2.16 MB/s) - `forumdata/3' saved [36,609]
--14:04:53-- http://forum.myprotein.co.uk/
=> `forumdata/4'
Resolving forum.myprotein.co.uk... 78.136.60.42
Connecting to forum.myprotein.co.uk[78.136.60.42]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
[ <=> ] 66,887 --.--K/s
14:04:59 (2.21 MB/s) - `forumdata/4' saved [66,887]
--14:04:59-- http://www.cyclechat.co.uk/forums/index.php
=> `forumdata/5'
Resolving www.cyclechat.co.uk... 80.87.131.154
Connecting to www.cyclechat.co.uk[80.87.131.154]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
[ <=> ] 103,817 --.--K/s
14:04:59 (1.82 MB/s) - `forumdata/5' saved [103,817]
--14:04:59-- http://www.mfat.co.uk/forums/
=> `forumdata/6'
Resolving www.mfat.co.uk... 94.229.64.194
Connecting to www.mfat.co.uk[94.229.64.194]:80... connected.
HTTP request sent, awaiting response... 404 Not Found
14:05:11 ERROR 404: Not Found.
--14:05:11-- http://vbulletin.thesite.org/index.php
=> `forumdata/7'
Resolving vbulletin.thesite.org... 195.194.49.9
Connecting to vbulletin.thesite.org[195.194.49.9]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
[ <=> ] 74,190 --.--K/s
14:05:11 (2.14 MB/s) - `forumdata/7' saved [74,190]
--14:05:11-- http://www.musclesoc.com/index.php
=> `forumdata/8'
Resolving www.musclesoc.com... 85.92.84.8
Connecting to www.musclesoc.com[85.92.84.8]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
[ <=> ] 84,037 --.--K/s
14:05:15 (2.01 MB/s) - `forumdata/8' saved [84,037]
--14:05:15-- http://www.shell-livewire.com/forums
=> `forumdata/9'
Resolving www.shell-livewire.com... 92.52.113.92
Connecting to www.shell-livewire.com[92.52.113.92]:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://www.shell-livewire.com/forums/ [following]
--14:05:20-- http://www.shell-livewire.com/forums/
=> `forumdata/9'
Connecting to www.shell-livewire.com[92.52.113.92]:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: /forums/forumdisplay.php?forumid=19 [following]
--14:05:20-- http://www.shell-livewire.com/forums...php?forumid=19
=> `forumdata/9'
Connecting to www.shell-livewire.com[92.52.113.92]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 58,473 [text/html]
100%[================================================== =======================>] 58,473 --.--K/s
14:05:20 (1.05 MB/s) - `forumdata/9' saved [58,473/58,473]
--14:05:20-- http://www.silkysteps.com/forum/
=> `forumdata/10'
Resolving www.silkysteps.com... 78.129.162.97
Connecting to www.silkysteps.com[78.129.162.97]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
[ <=> ] 104,240 --.--K/s
14:05:25 (1.12 MB/s) - `forumdata/10' saved [104,240]
--14:05:25-- http://www.p8ntballer-forums.com/vb/index.php
=> `forumdata/11'
Resolving www.p8ntballer-forums.com... 87.106.250.129
Connecting to www.p8ntballer-forums.com[87.106.250.129]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
[ <=> ] 78,951 --.--K/s
14:05:25 (726.29 KB/s) - `forumdata/11' saved [78,951]
|
etc
(there's 1000 forums in there, so I may as well go for that walk)
Then i shall write and run a perl script which, for now, just lets a user's search check inside all of that data, for now. But the best approach is to reprocess that data and create a keyword list / subject list and used the output as the place to scan for matches from a user's query
I'll come back and give you a search box to try out after lunch...
how's that for rapid development? Perl and Shell are like Starsky and Hutch, or Tate and Lyle, or Twix and Tea, or Good music and A music player. When you get the two together, an abundance of wonder erupts.
This has been an open source guide to web crawlers, search engines and other "secrets of the powerful".
|
|
|
|
01-13-2009, 09:32 AM
|
Re: Message board search engine
|
Posts: 922
Name: Geoff Vader
Location: In my dreams
|
Okay, I then cheated - item no. 138 was a url that wouldn't respond and just hung the **** script in the air, so after a few mins I quit the shell, deleted all the commands up to that one, and then relaunched. Naturally the correct procedure is more like some kind of timeout and then it moves on. not so hard to write, but one is lazy.
|
|
|
|
01-13-2009, 10:04 AM
|
Re: Message board search engine
|
Posts: 922
Name: Geoff Vader
Location: In my dreams
|
Right. give it a go if you fancy
http://www.ukusers4u.com/cgi-bin/forumsearch.pl
NB it's got a lot of cleaning up to do - eg for the content quotations it doesn't go to where there's definitely a match, so a lot of results show, in the blurb, no actual match, even though there is one
also, i haven't yet got it to remove the *£*£ out of any images or other tags that travel from their page to mine
but all in all it works
search for beer, or seo, or whatever - see what the top 1000 forums in the UK have been talking about in the present? also matching technique is limited
this is just the bare bones
here's the perl script of the search script...
Code:
#!/usr/bin/perl
print "Content-type:text/html\n\n";
use universal_variables;
%script_input=%universal_variables::in;
print "<form method=post action=\"/cgi-bin/forumsearch.pl\">query: <input type=text name=query>
<input type=submit value=seach></form>";
if ($script_input{query}!~m/\w/s){
exit;
}
$script_input{query}="\L$script_input{query}";
print "$script_input{query}<hr>";
foreach $n(1..1000){
open (file,"<forumdata/$n");
while ($line=<file>){
$line="\L$line";
if ($line=~m/\b$script_input{query}\b/s){
$thatcher{$n}="nixon";
push @zedmore, $line;
next;
}
}
close (file);
}
foreach $z(keys %thatcher){
push @out,$z;
}
open (ood,"<forums.txt");
@forums=<ood>;
close (file);
foreach $out(@out){
$zug++;
$zedmore[$zug]=~s(<a href=")(</a>)g;
@jackanory=split(//,$zedmore[$zug]);
$zedmore[$zug]="";
foreach $mumbo(0..150){
$zedmore[$zug].=$jackanory[$mumbo];
}
$zedmore[$zug]=~s(\b$script_input{query}\b)(<b>$script_input{query}</b>)g;
print "<a href=\"$forums[$out]\">$forums[$out]</a> - $zedmore[$zug]<hr>";
}
the crawler's done 486 forums so far, but within another 20 mins to half hour I suspect the entire first 1000 will be done.
so there you have it - a quick guide to building a search.
the next thing i do to it will be to get it to check for matches in the titles of href tags - so that all the forum posts are where it's checking. i can also then print out a summary of every site, and put together some daily updating indexes of subjects of posts on forums.
the real fun is only just about to begin. but I'm signing off from work mode now. i'm supposed to be retired. what good is being retired if you go and do something silly like work?
Last edited by witnesstheday; 01-13-2009 at 10:13 AM..
|
|
|
|
01-13-2009, 10:08 AM
|
Re: Message board search engine
|
Posts: 922
Name: Geoff Vader
Location: In my dreams
|
1 hour to build a basic search for 1000 forums. Not bad.
(nb universal variables is just a package through which i process all web input - it's pretty short and sweet, don't worry about it)
note how i managed to include a plug for the "ood" - that famous species of slave from the neo-doctor-who era.
Last edited by witnesstheday; 01-13-2009 at 10:11 AM..
|
|
|
|
|
« Reply to Message board search engine
|
|
|
| Thread Tools |
Search this Thread |
|
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|
|
|