网页爬虫程序pageSpider

blessed24

浏览: 275529 次
性别:
来自: 北京

最近访客更多访客>>

BeyondPC

wjzayy

yfxu10

903896940

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Search Engine

Java thread .net

2009-05-05 19:44

该程序仅对单个URL所对应的page网页信息进行抓取（pageSpider.java）。程序流程图如下：

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.ProtocolException;
import java.net.URL;

public class pageSpider implements Runnable {

HttpURLConnection httpUrlConnection;
InputStream inputStream;
BufferedReader bufferedReader;
String url;

public pageSpider() {

try {url="http://www.baidu.com"; } catch (Exception e) {e.printStackTrace();}

    try {
     httpUrlConnection = (HttpURLConnection) new URL(url).openConnection(); //创建连接
    } catch (MalformedURLException e) {
     e.printStackTrace();
    } catch (IOException e) {
     // TODO Auto-generated catch block
     e.printStackTrace();
    }

System.out.println("---------start-----------");

    Thread thread = new Thread(this);
    thread.start();
    try {thread.join();} catch (InterruptedException e) {e.printStackTrace();}

System.out.println("----------end------------");
}

public void run() {
    // TODO Auto-generated method stub
    try {
     httpUrlConnection.setRequestMethod("GET");
    } catch (ProtocolException e) {
     e.printStackTrace();
    }

    try {
     httpUrlConnection.setUseCaches(true); //使用缓存
     httpUrlConnection.connect();           //建立连接
    } catch (IOException e) {
     e.printStackTrace();
    }

    try {
     inputStream = httpUrlConnection.getInputStream(); //读取输入流
     bufferedReader = new BufferedReader(new InputStreamReader(inputStream, "gb2312"));
     String string;
     while ((string = bufferedReader.readLine()) != null) {
        System.out.println(string); //打印输出
     }
    } catch (IOException e) {
     e.printStackTrace();
    } finally {
     try {
      bufferedReader.close();
      inputStream.close();
      httpUrlConnection.disconnect();
     } catch (IOException e) {
      e.printStackTrace();
     }

}

public static void main(String[] args) {
new pageSpider();
}

}

分享到：

福布斯评出最具发展潜力10大搜索引擎 | JPA 批注参考

2010-12-02 12:34
浏览 738
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论