热门搜索词：安卓APP MySQL Javaweb 三维建模机械手施工组织建筑结构单片机

网络爬虫技术提取网页信息应用与研究

来源：doc163.com 资料编号：DC26466 文件类型：资料等级： %E8%B5%84%E6%96%99%E7%BC%96%E5%8F%B7%EF%BC%9ADC26466

以下是资料介绍,如需要完整的请充值下载.
1.无需注册登录,支付后按照提示操作即可获取该资料.
2.资料以网页介绍的为准,下载后不会有水印.仅供学习参考之用.
密惠保帮助中心

资料介绍：

网络爬虫技术提取网页信息应用与研究(任务书,开题报告,论文22000字,参考代码)
摘要
    在近几十年中，在全世界的用户及技术的人员的推动下，网络得到了高速的发展，万维网成了网络信息的载体。有许多应用需要将这些网络上的信息提取，如：搜索引擎，咨询采集，舆情监测等。从而，在巨大的Internet信息库中定位用户的信息将成为搜索技术未来研究的方向。本文主要研究一个网页信息的获取工具：网络爬虫。
    网络爬虫的主要框架包含网页获取，网页保存，生成索引。网络爬虫工作原理：已一个给定的网络URL超链接种子开始，建立客户端和服务器之间的连接，在获取指定的网页后，不停的从保存的网页上提取出新的URL集合插入队列。本文设计的爬虫主要有网页爬取功能，URL链接管理功能，索引生成功能，网页解析功能并设计人机交互界面。网页获取功能使用C#语法的WebClient方法，下载网页。URL管理功能实现通过建立一个Todo表存储访问过的链接以便爬虫确定正在使用链接是否访问过；该功能还使用正则表达式提取URL链接。网页解析功能，通过对HTML标签进行解析的方法实现了对网页的正文，标题，关键字和URL链接的提取。索引生成功能通过调用Luence.Net(全文检索开发包)类库，调用该类库的Indexwrite方法生成索引。整个爬虫设计通过C#语言实现，在VS2010环境中编写，使用SQL Server2008数据库存储数据，并实现对数据库文件的增加，删除，查询操作。爬虫系统的测试，以给定的一个URL链接，然后测试爬虫的各个功能运行情况。 [资料来源：http://Doc163.com]
关键字：网络爬虫；URL；页面分析

Abstract
In recent decades, the network driven by users and technical staff in the world has been high-speed development, and the World Wide Web has become the carrier of network information. There are many applications need to extract the information on these networks, such as: search engines, consulting collection, public opinion monitoring. Thus, positioning the user's information in a large Internet repository will be the direction of future research for searching technology.The mainly study of this article is about a tool called web crawler to acquire the web page’s information.
The main framework of the web crawler contains web page acquisition, web page saving, and indexing. Web crawler works like that it begin to establish the connection between the server and the client by a given network URL hyperlink;after obtaining the specified page,the cralwer keeps on extracting a new URL links from saved web pages and inserting into queue.The crawler designed in this paper mainly has web crawling function, URL link management function, index generation function, web page analysis function and designed human-computer interaction interface. Web page acquisition function uses the WebClient method of the C # syntax to download the page. The URL management function is that by creating a Todo table to store the visited URL links so that the crawler determines whether the link is being accessed or not; the function also uses the regular expression to extract the URL links. Web page analysis function, through the HTML tag to resolve the method to achieve the text of the page, title, keyword and URL link extraction. The index generation function generates an index by using the Luence.Net (full-text search development package) class library, which calls the library's Indexwrite method. The whole crawler design uses the C # language in the VS2010 environment,the tool named SQL Server2008 database to store data, and achieve the adding, deleting, querying operation about database file. The crawler system tests is that by a links given by user, the cralwer can work normally and the user tests the various functions of the crawler. [资料来源：http://www.doc163.com]
Keywords: web crawler；URL；page analysis

目录
摘要    I
Abstract    II
目录    III
第1章.绪论    1
1.1课题的研究背景及意义    1
1.2 网络爬虫国内外发展现状    1
1.3论文的相关研究内容    2
1.3.1本文研究的内容    2
1.3.2本文的组织结构    2
第2章相关理论及关键技术    1
2.1网络爬虫工作原理    1
2.2 HTTP协议    2
2.3 正则表达式    3
2.4 C#编程语言    4
2.4.1 C#概述    4
2.4.2 C#语言的特点    4
2.4.3 C#语言理论知识    5
2.5本章小结    5
第3章网络爬虫系统分析与设计    7
3.1 网络爬虫系统的需求分析    7
3.2 网络爬虫功能设计    8
3.3 网络爬虫主要功能设计    9 [资料来源：https://www.doc163.com]
3.3.1 网页爬取功能设计    9
3.3.2 URL管理    11
3.3.3 网页爬行策略设计    13
3.3.4页面解析功能设计    15
3.4 本章小结    18
第4章网页爬虫系统的实现    19
4.1 开发工具    19
4.2 网页爬虫各部分的实现    19
4.2.1 网页爬取功能实现    20
4.2.2 URL管理功能实现    20
4.2.3 爬虫爬行策略实现    20
4.2.4 网页解析实现    21
4.2.5 爬虫界面设计实现    22
4.2.6 生成索引设计    23
4.3线程管理实现    23
4.3.1 多线程优点    23
4.3.2 多线程缺点    24
4.3.2 多线程缺点    24
4.4 本章小结    26
第5章网络爬虫系统测试    27
5.1 测试环境    27
5.2测试过程    27

5.2.1单爬虫测试    27
5.2.2 多线程工作测试    28
5.2.3页面设计测试    28
5.3测试结果    30
5.4本章小结    30
第6章总结与展望    32
6.1 全文总结    32
6.2 展望    32
参考文献    33
附录    35
致谢    42 [资料来源：http://www.doc163.com]

以上是资料介绍,如需要完整的请充值下载

上一篇：未知环境下的机器人路径搜索算法与实验研究

下一篇：基于感兴趣区域的图像检索技术

扫地机器人的设计与路径规划研究	驾驶员疲劳检测系统设计研究(含CAD流程图,电路图)
图像填补技术研究	视频图像语义标注方法研究
哈希图像检索方法研究	未知环境下的机器人路径搜索算法与实验研究
面向基本蚁群算法的任务处理研究	云计算下的访问控制模型研究
移动电子商务环境下基于内容的推荐算法及应用	船舶航行虚拟现实仿真系统的研究与设计

网络爬虫技术提取网页信息应用与研究

相关内容：