Metadata-Version: 2.1
Name: gerapy-playwright
Version: 0.2.4
Summary: Playwright Components for Scrapy & Gerapy
Home-page: https://github.com/Gerapy/GerapyPlaywright
Author: Germey
Author-email: cqc@cuiqingcai.com
License: MIT
Platform: UNKNOWN
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Requires-Python: >=3.6.0
Description-Content-Type: text/markdown


# Gerapy Playwright

This is a package for supporting Playwright in Scrapy, also this
package is a module in [Gerapy](https://github.com/Gerapy/Gerapy).

## Note

**This package does not work on Windows**, please check this [issue](https://github.com/Gerapy/GerapyPlaywright/issues/2), please use this Package on Linux/Mac currently.

## Installation

```shell script
pip3 install gerapy-playwright
playwright install
```

## Usage

You can use `PlaywrightRequest` to specify a request which uses playwright to render.

For example:

```python
yield PlaywrightRequest(detail_url, callback=self.parse_detail)
```

And you also need to enable `PlaywrightMiddleware` in `DOWNLOADER_MIDDLEWARES`:

```python
DOWNLOADER_MIDDLEWARES = {
    'gerapy_playwright.downloadermiddlewares.PlaywrightMiddleware': 543,
}
```

Congratulate, you've finished the all of the required configuration.

If you run the Spider again, Playwright will be started to render every
web page which you configured the request as PlaywrightRequest.

## Settings

GerapyPlaywright provides some optional settings.

### Concurrency

You can directly use Scrapy's setting to set Concurrency of Playwright,
for example:

```python
CONCURRENT_REQUESTS = 3
```

### Pretend as Real Browser

Some website will detect WebDriver or Headless, GerapyPlaywright can
pretend Chromium by inject scripts. This is enabled by default.

You can close it if website does not detect WebDriver to speed up:

```python
GERAPY_PLAYWRIGHT_PRETEND = False
```

Also you can use `pretend` attribute in `PlaywrightRequest` to overwrite this
configuration.

### Logging Level

By default, Playwright will log all the debug messages, so GerapyPlaywright
configured the logging level of Playwright to WARNING.

If you want to see more logs from Playwright, you can change the this setting:

```python
import logging
GERAPY_PLAYWRIGHT_LOGGING_LEVEL = logging.DEBUG
```

### Download Timeout

Playwright may take some time to render the required web page, you can also change this setting, default is `30s`:

```python
# playwright timeout
GERAPY_PLAYWRIGHT_DOWNLOAD_TIMEOUT = 30
```

### Headless

By default, Playwright is running in `Headless` mode, you can also
change it to `False` as you need, default is `True`:

```python
GERAPY_PLAYWRIGHT_HEADLESS = False
```

### Window Size

You can also set the width and height of Playwright window:

```python
GERAPY_PLAYWRIGHT_WINDOW_WIDTH = 1400
GERAPY_PLAYWRIGHT_WINDOW_HEIGHT = 700
```

Default is 1400, 700.

### Proxy

You can set a proxy channel via below this config:

```python
GERAPY_PLAYWRIGHT_PROXY = 'http://tps254.kdlapi.com:15818'
GERAPY_PLAYWRIGHT_PROXY_CREDENTIAL = {
  'username': 'xxx',
  'password': 'xxxx'
}
```

### Screenshot

You can get screenshot of loaded page, you can pass `screenshot` args to `PlaywrightRequest` as dict:

Below are the supported args:

- `type` (str): Specify screenshot type, can be either `jpeg` or `png`. Defaults to `png`.
- `quality` (int): The quality of the image, between 0-100. Not applicable to `png` image.
- `full_page` (bool): When true, take a screenshot of the full scrollable page. Defaults to `False`.
- `clip` (dict): An object which specifies clipping region of the page. This option should have the following fields:
  - `x` (int): x-coordinate of top-left corner of clip area.
  - `y` (int): y-coordinate of top-left corner of clip area.
  - `width` (int): width of clipping area.
  - `height` (int): height of clipping area.
- `omit_background` (bool): Hide default white background and allow capturing screenshot with transparency.
- `timeout` (str): Maximum time in milliseconds, defaults to 30 seconds, pass 0 to disable timeout.

Check more from [https://playwright.dev/python/docs/api/class-page#page-screenshot](https://playwright.dev/python/docs/api/class-page#page-screenshot)。

For example:

```python
yield PlaywrightRequest(start_url, callback=self.parse_index, wait_for='.item .name', screenshot={
            'type': 'png',
            'full_page': True
        })
```

then you can get screenshot result in `response.meta['screenshot']`:

Simplest save it to file:

```python
def parse_index(self, response):
    with open('screenshot.png', 'wb') as f:
        f.write(response.meta['screenshot'].getbuffer())
```

If you want to enable screenshot for all requests, you can configure it by `GERAPY_PLAYWRIGHT_SCREENSHOT`.

For example:

```python
GERAPY_PLAYWRIGHT_SCREENSHOT = {
    'type': 'png',
    'full_page': True
}
```

## PlaywrightRequest

`PlaywrightRequest` provide args which can override global settings above.

- url: request url
- callback: callback
- wait_until: one of "load", "domcontentloaded", "networkidle"
  see [https://playwright.dev/python/docs/api/class-page#page-wait-for-load-state](https://playwright.dev/python/docs/api/class-page#page-wait-for-load-state), default is `domcontentloaded`
- wait_for: wait for some element to load, also supports dict
- script: script to execute
- actions: actions defined for execution of Page object
- proxy: use proxy for this time, like `http://x.x.x.x:x`
- proxy_credential: the proxy credential, like `{'username': 'xxxx', 'password': 'xxxx'}`
- sleep: time to sleep after loaded, override `GERAPY_PLAYWRIGHT_SLEEP`
- timeout: load timeout, override `GERAPY_PLAYWRIGHT_DOWNLOAD_TIMEOUT`
- ignore_resource_types: ignored resource types, override `GERAPY_PLAYWRIGHT_IGNORE_RESOURCE_TYPES`
- pretend: pretend as normal browser, override `GERAPY_PLAYWRIGHT_PRETEND`
- screenshot: ignored resource types, see [https://playwright.dev/python/docs/api/class-page#page-screenshot](https://playwright.dev/python/docs/api/class-page#page-screenshot),
  override `GERAPY_PLAYWRIGHT_SCREENSHOT`

For example, you can configure PlaywrightRequest as:

```python
from gerapy_playwright import PlaywrightRequest

def parse(self, response):
    yield PlaywrightRequest(url,
        callback=self.parse_detail,
        wait_until='domcontentloaded',
        wait_for='title',
        script='() => { return {name: "Germey"} }',
        sleep=2)
```

Then Playwright will:

- wait for document to load
- wait for title to load
- execute `console.log(document)` script
- sleep for 2s
- return the rendered web page content, get from `response.meta['screenshot']`
- return the script executed result, get from `response.meta['script_result']`

For waiting mechanism controlled by JavaScript, you can use await in `script`, for example:

```python
js = '''async () => {
    await new Promise(resolve => setTimeout(resolve, 10000));
    return {
        'name': 'Germey'
    }
}
'''
yield PlaywrightRequest(url, callback=self.parse, script=js)
```

Then you can get the script result from `response.meta['script_result']`, result is `{'name': 'Germey'}`.

If you think the JavaScript is wired to write, you can use actions argument to define a function to execute `Python` based functions, for example:

```python
async def execute_actions(page):
    await page.evaluate('() => { document.title = "Hello World"; }')
    return 1
yield PlaywrightRequest(url, callback=self.parse, actions=execute_actions)
```

Then you can get the actions result from `response.meta['actions_result']`, result is `1`.

Also you can define proxy and proxy_credential for each Reqest, for example:

```python
yield PlaywrightRequest(
  self.base_url,
  callback=self.parse_index,
  priority=10,
  proxy='http://tps254.kdlapi.com:15818',
  proxy_credential={
      'username': 'xxxx',
      'password': 'xxxx'
})
```

`proxy` and `proxy_credential` will override the settings `GERAPY_PLAYWRIGHT_PROXY` and `GERAPY_PLAYWRIGHT_PROXY_CREDENTIAL`.

## Example

For more detail, please see [example](./example).

Also you can directly run with Docker:

```shell
docker run germey/gerapy-playwright-example
```

Outputs:

```shell script
2021-12-27 16:54:14 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: example)
2021-12-27 16:54:14 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.7.9 (default, Aug 31 2020, 07:22:35) - [Clang 10.0.0 ], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 35.0.0, Platform Darwin-21.1.0-x86_64-i386-64bit
2021-12-27 16:54:14 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2021-12-27 16:54:14 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'example',
 'CONCURRENT_REQUESTS': 1,
 'NEWSPIDER_MODULE': 'example.spiders',
 'RETRY_HTTP_CODES': [403, 500, 502, 503, 504],
 'SPIDER_MODULES': ['example.spiders']}
2021-12-27 16:54:14 [scrapy.extensions.telnet] INFO: Telnet Password: e931b241390ad06a
2021-12-27 16:54:14 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2021-12-27 16:54:14 [gerapy.playwright] INFO: playwright libraries already installed
2021-12-27 16:54:14 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'gerapy_playwright.downloadermiddlewares.PlaywrightMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-12-27 16:54:14 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-12-27 16:54:14 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-12-27 16:54:14 [scrapy.core.engine] INFO: Spider opened
2021-12-27 16:54:14 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-12-27 16:54:14 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-12-27 16:54:14 [example.spiders.movie] DEBUG: start url https://antispider1.scrape.center/page/1
2021-12-27 16:54:14 [gerapy.playwright] DEBUG: processing request <GET https://antispider1.scrape.center/page/1>
2021-12-27 16:54:14 [gerapy.playwright] DEBUG: playwright_meta {'wait_until': 'domcontentloaded', 'wait_for': '.item', 'script': None, 'actions': None, 'sleep': None, 'proxy': None, 'proxy_credential': None, 'pretend': None, 'timeout': None, 'screenshot': None}
2021-12-27 16:54:14 [gerapy.playwright] DEBUG: set options {'headless': False}
cookies []
2021-12-27 16:54:16 [gerapy.playwright] DEBUG: PRETEND_SCRIPTS is run
2021-12-27 16:54:16 [gerapy.playwright] DEBUG: timeout 10
2021-12-27 16:54:16 [gerapy.playwright] DEBUG: crawling https://antispider1.scrape.center/page/1
2021-12-27 16:54:16 [gerapy.playwright] DEBUG: request https://antispider1.scrape.center/page/1 with options {'url': 'https://antispider1.scrape.center/page/1', 'wait_until': 'domcontentloaded'}
2021-12-27 16:54:18 [gerapy.playwright] DEBUG: waiting for .item
2021-12-27 16:54:18 [gerapy.playwright] DEBUG: sleep for 1s
2021-12-27 16:54:19 [gerapy.playwright] DEBUG: taking screenshot using args {'type': 'png', 'full_page': True}
2021-12-27 16:54:19 [gerapy.playwright] DEBUG: close playwright
2021-12-27 16:54:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://antispider1.scrape.center/page/1> (referer: None)
2021-12-27 16:54:20 [example.spiders.movie] DEBUG: start url https://antispider1.scrape.center/page/2
2021-12-27 16:54:20 [gerapy.playwright] DEBUG: processing request <GET https://antispider1.scrape.center/page/2>
2021-12-27 16:54:20 [gerapy.playwright] DEBUG: playwright_meta {'wait_until': 'domcontentloaded', 'wait_for': '.item', 'script': None, 'actions': None, 'sleep': None, 'proxy': None, 'proxy_credential': None, 'pretend': None, 'timeout': None, 'screenshot': None}
2021-12-27 16:54:20 [gerapy.playwright] DEBUG: set options {'headless': False}
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/1
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/2
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/3
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/4
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/5
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/6
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/7
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/8
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/9
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/10
cookies []
2021-12-27 16:54:21 [gerapy.playwright] DEBUG: PRETEND_SCRIPTS is run
2021-12-27 16:54:21 [gerapy.playwright] DEBUG: timeout 10
2021-12-27 16:54:21 [gerapy.playwright] DEBUG: crawling https://antispider1.scrape.center/page/2
2021-12-27 16:54:21 [gerapy.playwright] DEBUG: request https://antispider1.scrape.center/page/2 with options {'url': 'https://antispider1.scrape.center/page/2', 'wait_until': 'domcontentloaded'}
2021-12-27 16:54:23 [gerapy.playwright] DEBUG: waiting for .item
2021-12-27 16:54:24 [gerapy.playwright] DEBUG: sleep for 1s
2021-12-27 16:54:25 [gerapy.playwright] DEBUG: taking screenshot using args {'type': 'png', 'full_page': True}
2021-12-27 16:54:25 [gerapy.playwright] DEBUG: close playwright
2021-12-27 16:54:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://antispider1.scrape.center/page/2> (referer: None)
2021-12-27 16:54:25 [gerapy.playwright] DEBUG: processing request <GET https://antispider1.scrape.center/detail/10>
2021-12-27 16:54:25 [gerapy.playwright] DEBUG: playwright_meta {'wait_until': 'domcontentloaded', 'wait_for': '.item', 'script': None, 'actions': None, 'sleep': None, 'proxy': None, 'proxy_credential': None, 'pretend': None, 'timeout': None, 'screenshot': None}
2021-12-27 16:54:25 [gerapy.playwright] DEBUG: set options {'headless': False}
...
```


