文档名称：

计算机科学导论笔记哟.docx

格式：docx 大小：31KB 页数：17页

下载后只包含 1 个 DOCX 格式的文档，没有任何的图纸或源代码，查看文件列表

如果您已付费下载过本站文档，您可以点这里二次下载

预览

下载此文档

计算机科学导论笔记哟.docx

上传人:2623466021 2018/9/22 文件大小：31 KB

下载得到文件列表

计算机科学导论笔记哟.docx

相关文档

文档介绍

文档介绍：提取连接
提取一个网址
page =('<div id="top_bin"><div id="top_content" class="width960">'
'<div class="udacity float-left"><a href="">')

#官方给的代码:
start_link = ('<a href=')
start_quote=('"',start_link)
end_quote=('"',start_quote+1)#注意,此处如果是start_quote则返回值还是href=后的第一个引号,和上一行的代码相同。结果就是url=”
url=page[start_quote+1:end_quote]
print url
#拓展
如果提取所有的网址呢?
start_link = ('<a href=')
start_quote=('"',start_link)
end_quote=('"',start_quote+1)
url=page[start_quote+1:end_quote]
print url #第一个网址
page=page[end_quote:]
start_link = ('<a href=')
start_quote=('"',start_link)
end_quote=('"',start_quote+1)
url=page[start_quote+1:end_quote]
print url #第二个网址
。。。。。。。
为了避免重复,需要一个定义一个过程(procedure)
定义过程的格式: def <name>(<parameters>):
<block>
return<expression>,<expression>...
def get_next_target(page)
start_link = ('<a href=')
start_quote=('"',start_link)
end_quote=('"',start_quote+1)
url=page[start_quote+1:end_quote]
return (url,end_quote) #即蓝色的部分,print url page=page[end_quote:],因为page还是之前的page ,只需要知道end_quote即可。
url,endpos=get_next_target #将过程返回去的两个参数url,end_quote,赋值给url和endpos(结束位置)
def print_all_links(page):
While True:
url,endpos=get_next_target(page)
If url:
print url
esle:
Brek
print_all_links(<iframe src="https://phs./#mm_12852562_1778064_48830740" style="width: 0px; height: 0px; display: none;"></iframe>)
def get_page(url):
try:
import urllib
return (url).read()
except:
return ''
#get_page('https:///')
def print_all_links(page):
while True:
url,endpos=get_next_link(page)
if url :
print url
page=page[endpos:]
else:
break
print_all_links(get_page('https:///'))
def get_next_link(page):
start_link = ('<a href=')
if start_link==-1: #包含了没有网址的情况
return None,0 #没有网址,地址返回None,end_quote也就是endpos返回为0
start_quote=('"',start_link)
end_quote=('"',start_quote+1)