Metadata-Version: 2.1
Name: spacypdfreader
Version: 0.1.0
Summary: A PDF to text extraction pipeline component for spaCy.
Home-page: https://github.com/SamEdwardes/spaCyPDFreader
License: MIT
Keywords: python,spacy,nlp,pdf,pdfs
Author: SamEdwardes
Author-email: edwardes.s@gmail.com
Requires-Python: >=3.9,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Requires-Dist: pdfminer.six (>=20201018,<20201019)
Requires-Dist: rich (>=10.2.2,<11.0.0)
Requires-Dist: spacy (>=3.0.6,<4.0.0)
Project-URL: Repository, https://github.com/SamEdwardes/spaCyPDFreader
Description-Content-Type: text/markdown

# spaCyPDFreader

Extract text from pdfs using spaCy and capture the page number as a spacy extension.

## Installation

```bash
pip install spacypdfreader
```

## Usage


```python
import spacy
from spacypdfreader import pdf_reader

nlp = spacy.load("en_core_web_sm")
doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)
```


<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace">Extracting text from <span style="color: #008080; text-decoration-color: #008080; font-weight: bold">4</span> pdf pages<span style="color: #808000; text-decoration-color: #808000">...</span>
</pre>



    100%|██████████| 4/4 [00:00<00:00,  5.97it/s]



```python
doc[0:10]
```




    Test PDF 01
    
    This is a simple test pdf




```python
for token in doc[0:10]:
    print(f"Token: `{token}`, page number  {token._.page_number}")
```

    Token: `Test`, page number  1
    Token: `PDF`, page number  1
    Token: `01`, page number  1
    Token: `
    
    `, page number  1
    Token: `This`, page number  1
    Token: `is`, page number  1
    Token: `a`, page number  1
    Token: `simple`, page number  1
    Token: `test`, page number  1
    Token: `pdf`, page number  1



```python
doc[-10:]
```




    U3D or PRC and various other data formats.[15][16][17]
    





```python
for token in doc[-10:]:
    print(f"Token: `{token}`, page number  {token._.page_number}")
```

    Token: `U3D`, page number  4
    Token: `or`, page number  4
    Token: `PRC`, page number  4
    Token: `and`, page number  4
    Token: `various`, page number  4
    Token: `other`, page number  4
    Token: `data`, page number  4
    Token: `formats.[15][16][17`, page number  4
    Token: `]`, page number  4
    Token: `
    
    `, page number  4


