Web_Crawler爬top100电影信息

早上看到一个公众微信号发了个爬猫眼top100的电影信息大致看了一下把源码修改了将数据存储到MongoDB里面，方便之后更新到jupyter上进行数据可视化分析:

==============================

import requests 
import pymongo
import re
from requests.exceptions import RequestException
from multiprocessing import Pool

只需要这5个库，按顺序分别用来进行请求界面，连接MongoDB, 使用re的pattern方法，请求异常，多进程与池

建立数据库 webCrawler_CatEYE.p

1 client = pymongo.MongoClient(‘localhost’,27017) 2 cateye = client[‘cateye’] 3 content = cateye[‘content’]

webCrawler_CatEYE.py


def get_one_page(url):
  try:
      res = requests.get(url,headers = headers)
      if res.status_code == 200:
          return res.text
      return None
  except RequestException:
      return None

# Building the page compiler
def parse_one_page(html):
       pattern = re.compile('<dd>.*?board-index.*?>(d+)</i>.*?data-src="(.*?)".*?name"><a'
                            + '.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'
                            + '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>', re.S)
       items = re.findall(pattern, html)
       for item in items:
           yield {
               'index': item[0],
               'image': item[1],
               'title': item[2],
               'actor': item[3].strip()[3:],
               'time': item[4].strip()[5:],
               'score': item[5] + item[6]
           }

# Stoing the data from pages (Into mongoDB)
def write_to_mongoDB(item):
   #content.insert_one({'index': item[0], 'image': item[1], 'title': item[2], 'actor': item[3].strip()[3:], 'time':item[4].strip()[5:],'score':item[5] + item[6]})
   content.insert_one(item)

# Main function
def main(offset):
   url = 'http://maoyan.com/board/4?offset=' + str(offset)
   html = get_one_page(url)
   for item in parse_one_page(html):
       print(item)
       write_to_mongoDB(item)


if __name__ == '__main__':
   p = Pool()
   p.map(main,[i*10 for i in range(10)])


稍微解释一下第一个函数请求一个页面用requests.get，然后进行try一下

然后对界面进行解析，我们比较常用的是BeautifulSoup来解析，这里我们使用re中的pattern来进行匹配。然后用yield数据结构进行存储。前面建立好数据库，到了后面直接用表的一个Insert_one方法进行插入，顺爽丝滑

主程序每个界面访问完就打印出来写进数据库

然后用多进程进行，POOL函数会自动判别你的系统是多少核的，不需要自己去查然后再写上去，这很方便，然后map起来，就大功告成

运行结果

screenshot

数据库

screenshot

之后想看看根据类别爬1000部电影，然后可视化数据出来，画一些fancy的图，就这样。

btw，粗略看了下，这100部我都看过

Enjoy Reading This Article?