Request对象
class scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback, flags])
一个request 对象代表了一上HTTP请求,在spider里产生,被Downloader执行后,产生一个Response
####利用callback
def parse_page1(self, response):
return scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
def parse_page2(self, response):
# this would log http://www.example.com/some_page.html
self.logger.info("Visited %s", response.url)
参数
- url:
- callback:如果不指定,则将调用下面的parse()方法
- method:默认
GET
- meta:(dict)
- body
- headers
- cookies:两种形式,如下
egs:
1.用dict
reqeust_with_cookies = Request(url="http://www.example.com",cookies={'currency':'USD','country':'UY'})
2.用list
reqeust_with_cookies = Request(url="http://www.example.com",
cookies=[{'name': 'currency',
'value': 'USD',
'domain': 'example.com',
'path': '/currency'}])
FormRequest
class scrapy.http.FormRequest(url[, formdata, ...])
egs:
- 利用HTTP post用FormRequests发送数据
return [FormRequest(url="http://www.example.com/post/action", formdata={'name': 'John Doe', 'age': '27'}, callback=self.after_post)]
- 模拟登陆
```python
-- coding: utf-8 --
import scrapy from logging import warning
class ZhilianSpider(scrapy.Spider): name = ‘zhilian’ # allowed_domains = [‘www.zhaopin.com’] start_urls = [‘https://passport.zhaopin.com/account/login’]
def parse(self, response):
warning("{}".format(response.status))
return scrapy.FormRequest.from_response(
response,
formdata={
'int_count': '999',
'errUrl': "https://passport.zhaopin.com/account/login",
'RememberMe': 'false',
'requestFrom': 'portal',
'loginname':'925370765@qq.com',
'Password': 'xxx',
},
callback=self.after_login
)
def after_login(self,response):
self.logger.warn("{}".format(response.url))
#self.logger.warn("{}".format(type(response.url)))
#由于登陆成功后redirect到https://i.zhaopin.com,因此通过这个来判断是否登陆成功
if response.url == "https://i.zhaopin.com":
self.logger.warn("Login sucess")
else:
self.logger.error("Login fail")
return ```
HtmlResponse
class scrapy.http.HtmlResponse(url[, ...])
导入
In [1]: from scrapy.http import HtmlResponse
In [2]: from scrapy.selector import Selector
In [3]: body="""
...: <html>
...: <head>
...: <base href='http://example.com/' />
...: <title>Example website</title>
...: </head>
...: <body>
...: <div id='images'>
...: <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.
...: jpg' /></a>
...: <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.
...: jpg' /></a>
...: <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.
...: jpg' /></a>
...: <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.
...: jpg' /></a>
...: <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.
...: jpg' /></a>
...: </div>
...: </body>
...: </html>
"""
In [4]: url = "https://doc.scrapy.org/en/latest/_static/selectors-sample1.html"
In [6]: response = HtmlResponse(url=url,body=body,encoding="utf8")
Out[6]: <200 https://doc.scrapy.org/en/latest/_static/selectors-sample1.html>
In [13]: response.xpath("//title")
Out[13]: [<Selector xpath='//title' data='<title>Example website</title>'>]
response.urljoin
该方法是对urlparse.urljoin的封装
urlparse.urljoin(base, url[, allow_fragments])
- 若后面的url是相对路径,则与第一个url合并产生新的url
>>> from urlparse import urljoin >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html') 'http://www.cwi.nl/%7Eguido/FAQ.html'
- 若后面的url是绝对路径,则直接用第二个url
>>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', ... '//www.python.org/%7Eguido') 'http://www.python.org/%7Eguido'
用法:response.urljoin(url),与此类似,只不过省略了base,base默认为response.url; 另response.follow是对该方法的进一步封装 用法如下:
yield response.follow(next_page,callback=self.parse) yield Request(response.urljoin(next_page),callback=self.parse)
以上二者的作用一样