Reply
Old 08-02-2004, 12:29 AM Need help again...
Novice Talker

Posts: 6
Trades: 0
Well I have ran into another problem just recently. This problem is that I have no clue how to make a php crawler keep on following links. Is there actually any way to do this in the first place? Here is what I have so far.


Quote:
function search() {
global $url;
$site = $url;
$data = file_get_contents($site);
if (preg_match_all("/http:\/\/[^\"\s']+/", $data, $links) {
for ($i=0;$i<count($links[0]);$i++) {
echo "<font size=\"2\" face=\"verdana\">".$links[0][$i]."</font><br>";
}
}
}
demoIXI is offline
Reply With Quote
View Public Profile
 
 
When You Register, These Ads Go Away!
Old 08-02-2004, 03:30 PM
Novice Talker

Posts: 6
Trades: 0
Can anyone here please help me out? I have been searching for such a long time on a way to do it but I still cant find one.
demoIXI is offline
Reply With Quote
View Public Profile
 
Old 08-02-2004, 04:23 PM
Kyrnt's Avatar
The Post-Mod Years

Posts: 2,535
Location: Western Maryland
Trades: 0
demo,

I want to respond just to ensure you that people are reading your thread. I don't personally know how to write a "crawler." I didn't test your code but just looked at it and it seems that it is only printing out links that appear on the page. What in your code actually causes each of those links to be crawled. Are you calling the search() function on each of the links in your array?
__________________
—Kyrnt
Kyrnt is offline
Reply With Quote
View Public Profile Visit Kyrnt's homepage!
 
Old 08-04-2004, 12:54 PM
Novice Talker

Posts: 6
Trades: 0
Thats the problem. I have no idea how to call the other links. I have been searching for such a long time and still no use.
demoIXI is offline
Reply With Quote
View Public Profile
 
Old 08-04-2004, 01:08 PM
Kyrnt's Avatar
The Post-Mod Years

Posts: 2,535
Location: Western Maryland
Trades: 0
How about calling your function get_file_contents() on each of the harvested links? Wouldn't that be the same functionality as visiting the original URL?
__________________
—Kyrnt
Kyrnt is offline
Reply With Quote
View Public Profile Visit Kyrnt's homepage!
 
Old 08-04-2004, 03:09 PM
Novice Talker

Posts: 6
Trades: 0
I tried that and it does not work. Do you have any other ideas?
demoIXI is offline
Reply With Quote
View Public Profile
 
Old 08-04-2004, 05:21 PM
Kyrnt's Avatar
The Post-Mod Years

Posts: 2,535
Location: Western Maryland
Trades: 0
Quote:
Originally Posted by demoIXI
I tried that and it does not work. Do you have any other ideas?
Well, does it work for the original URL. For example, if it can successfully pull the contents from one url (e.g., http://www.cnn.com), then it should be equally effective crawling the links harvested from that page.

You know what's probably happening? You are probably harvesting relative URLs. I'm sure of it.

For example, when you point your program at http://www.somewhere.com and you harvest links off that page, the URLs are most likely relative URLs (e.g., href="FAQs.html", href="Contact.php", etc.). If, then, you try to submit just that value to your function, it will fail because the URL is not qualified. You need to add some logic to your code to see if the URL starts with "http://". If it doesn't, then you need to take the current URL, remove the final document element (e.g., "http://www.somewhere.com/Information/index.php" becomes "http://www.somewhere.com/Information/") and then add the relative URL back to it ("http://www.somewhere.com/Information/FAQs.html") and resubmit to your function file_get_contents().
__________________
—Kyrnt
Kyrnt is offline
Reply With Quote
View Public Profile Visit Kyrnt's homepage!
 
Old 08-04-2004, 05:52 PM
Novice Talker

Posts: 6
Trades: 0
Thats not the problem dude. The problem is that I have no clue how to make a search crawler search deeper then just the first links.


My question is: How can you make bots harvest deeper then the first link that you supply it with?
demoIXI is offline
Reply With Quote
View Public Profile
 
Reply     « Reply to Need help again...
 

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off





   
RSS Feed  Feeds: RSS   JS   XML
RSS Feed  Feeds for this forum: RSS   JS   XML

 



Page generated in 0.21345 seconds with 13 queries