Recursive division method

Mazes can be created with recursive division, an algorithm which works as follows: Begin with the maze’s space with no walls. Call this a chamber. Divide the chamber with a randomly positioned wall (or multiple walls) where each wall contains a randomly positioned passage opening within it. Then recursively repeat the process on the subchambers until all chambers are minimum sized. This method results in mazes with long straight walls crossing their space, making it easier to see which areas to avoid.

For example, in a rectangular maze, build at random points two walls that are perpendicular to each other. These two walls divide the large chamber into four smaller chambers separated by four walls. Choose three of the four walls at random, and open a one cell-wide hole at a random point in each of the three. Continue in this manner recursively, until every chamber has a width of one cell in either of the two directions.

阅读全文 »

Url Seen用来做url去重。对于一个大的爬虫系统,它可能已经有百亿或者千亿的url,新来一个url如何能快速的判断url是否已经出现过非常关键。因为大的爬虫系统可能一秒钟就会下载几千个网页,一个网页一般能够抽取出几十个url,而每个url都需要执行去重操作,可想每秒需要执行大量的去重操作。因此Url Seen是整个爬虫系统中非常有技术含量的一个部分。

阅读全文 »

Url Filter则是对提取出来的URL再进行一次筛选。不同的应用筛选的标准是不一样的,比如对于baidu/google的搜索,一般不进行筛选,但是对于垂直搜索或者定向抓取的应用,那么它可能只需要满足某个条件的url,比如不需要图片的url,比如只需要某个特定网站的url等等。Url Filter是一个和应用密切相关的模块。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
using System;
using System.Collections.Generic;
using Crawler.Common;

namespace Crawler.Processing
{
public class UrlFilter
{
public static List<Uri> RemoveByRegex(List<Uri> uris, params string[] regexs)
{
var uriList=new List<Uri>(uris);
for (var i = 0; i < uriList.Count; i++)
{
foreach (var r in regexs)
{
if (!RegexHelper.IsMatch(uriList[i].ToString(), r)) continue;
uris.RemoveAt(i);
i--;
}
}
return uriList;
}

public static List<Uri> SelectByRegex(List<Uri> uris, params string[] regexs)
{
var uriList = new List<Uri>();
foreach (var t in uris)
foreach (var r in regexs)
if (RegexHelper.IsMatch(t.ToString(), r))
if(!uriList.Contains(t))
uriList.Add(t);
return uriList;
}

}
}

Extractor的工作是从下载的网页中将它包含的所有URL提取出来。这是个细致的工作,你需要考虑到所有可能的url的样式,比如网页中常常会包含相对路径的url,提取的时候需要将它转换成绝对路径。这里我们选择使用正则表达式来完成链接的提取。

html标签中的链接地址通常会出现在href属性或者src属性中,所以我们采用两个正则表达式来匹配网页中的所有链接地址。

阅读全文 »

最近DotNetCore更新到了1.0.1,Azure tools也更新到了2.9.5,尝试更新时发现,DotNetCore更新失败,提示:0x80072f8a未指定的错误,而Azure Tools中也包含了DotNetCore的更新,0x80072f8a问题,导致两个软件都不能成功地完成更新。

研究安装的错误日志后才发现,原来使因为证书过期导致的无法下载微软在线资源,所以无法成功安装,解决证书问题之后就顺利的成功安装啦!

阅读全文 »