Metadata-Version: 2.1
Name: hnscraper
Version: 0.1.0
Summary: Hackernews Scraper
Home-page: https://github.com/pushp1997/Hackernews-Scraping/
License: MIT
Author: Pushp Vashisht
Author-email: pushptyagi1@gmail.com
Requires-Python: >=3.7,<4
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Requires-Dist: BeautifulSoup4 (>=4.11.1,<5.0.0)
Requires-Dist: jupyterlab (>=3.4.3,<4.0.0)
Requires-Dist: notebook (>=6.4.12,<7.0.0)
Requires-Dist: pathlib (>=1.0.1,<2.0.0)
Requires-Dist: pymongo (>=4.1.1,<5.0.0)
Requires-Dist: requests (>=2.28.0,<3.0.0)
Project-URL: Repository, https://github.com/pushp1997/Hackernews-Scraping/
Description-Content-Type: text/markdown

# Hackernews-Scraping
![Tests](https://github.com/pushp1997/Hackernews-Scraping/actions/workflows/tests.yml/badge.svg)
### Business Requirements:

1. Scrape TheHackernews.com and store the result (Description, Image, Title, Url) in mongo db
2. Maintain two relations - 1 with the url and title of the blog and other one with url and its meta data like (Description, Image, Title, Author)

### Requirements:

-   python3
-   pip
-   python libraries:
    _ requests
    _ BeautifulSoup4
    _ pymongo
    _ jupyterlab \* notebook
-   MongoDB
-   git

## To run the application on your local machine:

### Clone the repository:

1. Type the following in your terminal

    `git clone https://github.com/pushp1997/Hackernews-Scraping.git`

2. Change the directory into the repository

    `cd ./Hackernews-Scraping`

3. Create python virtual environment

    `python3 -m venv ./scrapeVenv`

4. Activate the virtual environment created

    - On linux / MacOS :
      `source ./scrapeVenv/bin/activate`
    - On Windows (cmd) :
      `"./scrapeVenv/Scripts/activate.bat"`
    - On Windows (powershell) :
      `"./scrapeVenv/Scripts/activate.ps1"`

5. Install python requirements

    `pip install -r requirements.txt`

6. Open the ipynb using jupyter notebook

    `jupyter notebook "Hackernews Scraper.ipynb"`

7. Run the notebook, you will be asked to provide inputs for no of pages to scrape to get the post and your MongoDB database URI to store the posts data.

8. Open mongodb shell connecting to the same URI you provided to the ipynb notebook while running it.

9. Change the database

    `use hackernews`

10. Print the documents in the 'url-title' collection

    `db["url-title"].find().pretty()`

11. Print the documents in the 'url-others' collection

    `db["url-others"].find().pretty()`

