WDFAP

WDFAP: Web Data Fetcher And Preparer

WDFAP is a Python tool that enables you to fetch, clean, prepare, label, and upload articles from web sources in various formats such as csv, json, xlsx, and parquet. It provides you with a versatile way to access, analyze, and manage diverse sets of data.

[!NOTE]

Please note that currently, only the fetching feature is available.

Installation

To get started with the WDFAP, follow these simple steps:

Clone this repository to your local machine:

git clone git@github.com:IsmaelMousa/WDFAP.git

Navigate to the WDFAP directory:

cd WDFAP

Setup virtual environment:

python3 -m venv .venv

Activate the virtual environment:

source .venv/bin/activate

Setup WDFAP:

make setup

Usage

The WDFAP provides a user-friendly interface for fetching articles. For now, you can choose to fetch articles from Wikipedia, Google News, or both simultaneously. The fetched data is stored in the data/ directory in different formats for easy access and analysis.

Run the WDFAP:

make start

After that the terminal will ask you a few questions, here is an example with results:

![Example](.github/images/demo.gif)

Modules

Here is a summary for the purpose of each major module or component in WDFAP:

Click for more information:

| Module | Purpose | |:------------------:|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `tools` | Provides utility functions and scripts for orchestrating the fetching, cleaning, labeling, and uploading of data from various sources. Initially includes a script for user interaction to fetch articles from Web Sources asynchronously. | | `sources` | Provides modules for fetching articles asynchronously from different sources like Google News & Wikipedia. | | `data` | Storage Where fetched articles are stored in various formats such as `csv`, `json`, `xlsx` and `parquet`. | | `errors` | Prepares and customizes exceptions for handling specific issues. | | `utils` | Houses common utilities/logic utilized throughout the project. | | `configs` | Contains main configurations for both development and production stages. | | `setup.py` | Configures the project metadata and dependencies for streamlined installation. | | `main.py` | Serves as the entry point, initiating the project. | | `Makefile` | Provides commands for installing dependencies and running the application. | | `requierments.txt` | Lists all the required dependencies for running the application. |

Dependencies

Here is an overview of the dependencies/packages used in the WDFAP along with their respective usage:

Click for more information:

| Dependency | Usage | |:----------------:|----------------------------------------------------------------------------------------------------------------------------------------------------------------| | `beautifulsoup4` | Offers powerful tools for parsing and navigating HTML documents, simplifying the extraction of structured data from web pages. | | `newspaper` | Simplifies the extraction and curation of articles from online sources, streamlining the process of gathering news content. | | `feedparser` | Parses RSS and Atom feeds, enabling extraction of syndicated content from websites and blogs. | | `asyncio` | Facilitates asynchronous I/O operations, allowing for concurrent execution of tasks without blocking the event loop. | | `aiohttp` | Provides asynchronous HTTP client/server functionality for asyncio, enabling efficient handling of web requests and responses. | | `pandas` | Provides high-performance data manipulation and analysis tools, ideal for working with structured datasets. | | `tqdm` | Enhances loops with progress bars, providing visual feedback on the progress of iterative tasks, improving user experience and productivity. | | `openpyxl` | Facilitates reading from and writing to Excel files, enabling manipulation of spreadsheet data with Python. | | `pyarrow` | Provides tools for working with Apache Arrow data, an in-memory columnar data format, offering efficient data interchange between different systems. | | `fastparquet` | Offers efficient reading and writing of Parquet files, a columnar storage format optimized for analytics workloads, enabling high-performance data processing. |

Sources

For now the available web sources are:

Contributing

We appreciate your interest in contributing to our project! Your contributions help us improve and grow.

Please check Contributing for the contribution guidelines, and make sure to read CODE_OF_CONDUCT document.

This site is open source. Improve this page.