Python 爬取网页数据的两种方法

# -*- coding:gbk -*-
“””
Created on Mon Mar 6 21:37:43 2023

@author: Administrator
“””
“””
Python 爬取网页数据的两种方法
https://www.worldometers.info/coronavirus/
“””

import pandas as pd #导入pandas库
import jieba
import jieba.analyse
import re
from urllib.request import Request, urlopen

“””
1.使用urllib爬取网页数据并写入Excel表
“””
import urllib.request #导入urllib库

url = urllib.request.urlopen(“https://industry.cfi.cn/BCA0A4127A4128A4138.html”) #需要抓取数据的网站
data = url.read()
dt1 = open(“D:/wst/2.xls”,”wb”) #xls表的位置，会自动生成xls表
dt1.write(data) #将数据写入D:/Code/data/2.xls表中
dt1.close()
print(data)

“””
2. 使用pandas 爬取网页数据
“””

“””
html = “https://www.if18.vip/weibo/archives/351.html” #将要爬取数据的网站网址复制到此
date = pd.read_html(html) #运用pd.read_html读取网站数据
#date = pd.read_html(html, encoding=”utf-8″)[0]
#print(date)
print(type(date))
#print(date[4]) #输出爬取到的数据
#print(type(date[4])) #输出爬取到的数据
#print (date[2])

#datew = “,”.join(date)
#wf = open(’13.txt’,’w+’)
#wf.write(date)
#wf.close()
“””

“””
print(“以上是作为列表的date的数据输出显示”)
#下面是sam结合“python实现简单中文词频统计示例.py”改造。
keywords = jieba.analyse.extract_tags(date[2].iat[0,0])
# 访问提取结果
print(keywords)
#访问上述列表（keywords）中的每个元素
for item in keywords:
# 分别为关键词和相应的权重
print(item[0], item[1])
“””

“””
————————————————
版权声明：本文为CSDN博主「菇毒」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。

Python 爬取网页数据的两种方法
https://passport.gitcode.net/cross?ticket=26b10ad5-4f86-4e6d-9023-1df2b4451090&redirectUrl=https%3A%2F%2Fblog.csdn.net%2Fweixin_43960383%2Farticle%2Fdetails%2F120103913
原文链接：https://blog.csdn.net/weixin_43960383/article/details/120103913

“””

“””
ValueError：使用pd.read_html（）时找不到匹配模式’。+’的表
https://www.pythonheidong.com/blog/article/443991/4ac303af4ffb24887d80/
“””
“””
req = Request(‘https://www.worldometers.info/coronavirus/’, headers={‘User-Agent’: ‘Firefox/76.0.1′})

webpage = re.sub(r'<.*?>’, lambda g: g.group(0).upper(), urlopen(req).read().decode(‘utf-8’) )
tables = pd.read_html(webpage)
print(tables)
“””

发表评论 取消回复

发表评论取消回复