Metadata-Version: 2.1
Name: genz-tokenize
Version: 1.1.3
Summary: Tokenize for subword
Home-page: https://github.com/nghiemIUH/genz-tokenize
Author: Van Nghiem
Author-email: vannghiem848@gmail.com
License: MIT
Platform: UNKNOWN
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown
License-File: LICENSE

# Genz Tokenize

[Github](https://github.com/nghiemIUH/genz-tokenize)

## Cài đặt:

    pip install genz-tokenize

## Sử dụng cho tokenize thông thường

```python
    >>> from genz_tokenize import Tokenize
    # sử dụng vocab sẵn có của thư viện
    >>> tokenize = Tokenize()
    >>>  print(tokenize(['sinh_viên công_nghệ', 'hello'], maxlen = 5))
    # [[1, 288, 433, 2, 0], [1, 20226, 2, 0, 0]]
    >>> print(tokenize.decode([1, 288, 2]))
    # <s> sinh_viên </s>
    # Sử dụng vocab tự tạo
    >>> tokenize = Tokenize.fromFile('vocab.txt','bpe.codes')
```

## Sử dụng tokenize cho model bert của thư viện transformers

```python
    >>> from genz_tokenize import TokenizeForBert
    # sử dụng vocab sẵn có của thư viện
    >>> tokenize = TokenizeForBert()
    >>> print(tokenize(['sinh_viên công_nghệ', 'hello'], max_length=5, padding='max_length',truncation=True))
    # {'input_ids': [[1, 287, 432, 2, 0], [1, 20225, 2, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 0], [1, 1, 1, 0, 0]]}
    # Sử dụng vocab tự tạo
    >>> tokenize = TokenizeForBert.fromFile('vocab.txt','bpe.codes')
```

## Embedding matrix

```python
    >>> from genz_tokenize import get_embedding_matrix
    >>> embedding_matrix = get_embedding_matrix()
```

### Có thể tạo vocab cho riêng mình bằng thư viện [subword-nmt (learn-joint-bpe-and-vocab)](https://github.com/rsennrich/subword-nmt)


