AI Data Scraper

Minimal AI web scraper

  • Year

    2025

  • Type of Project

    Web scraping & data preprocessing utility

  • My Role

    Frontend Engineer & AI Integrator

Case Study

Objective

Create a lightweight, browser-based tool to responsibly scrape and preprocess website text for AI training, supporting both raw text extraction and structured JSON output driven by custom prompts.​ Provide non-technical users with an easy way to batch-process multiple URLs and export AI-ready text or JSON data while enforcing basic ethical checks.

https://github.com/ujjwalredd?tab=repositories

Process

  • Implemented a React + TypeScript single-page application with Tailwind CSS for a clean, responsive UI that works on desktop and mobile.​

  • Integrated Google Gemini API as the backend logic to transform scraped content into either cleaned text or structured JSON based on user-defined prompts.​

  • Added dual modes (Text and JSON), batch URL input via textarea or .txt upload, concurrent processing for multiple URLs, and real-time status display per URL.​

  • Built utilities to copy results to clipboard or download them as .txt/.json files, and wired an initial copyright “respect-first” check that blocks scraping when restrictions are detected.​

Outcome

  • Delivered an interactive web app that can process multiple URLs at once and return AI-ready text or structured JSON suitable for building training corpora or small domain-specific datasets.​

  • Improved data collection workflows by combining scraping, prompt-based structuring, and export options into a single, minimal interface requiring only a browser and an API key.

Standout Features

  • Dual-mode, prompt-driven extraction

  • Batch Processing

  • Ethical Copyright Check

  • Concurrent Processing

  • Download & Copy

  • Clean & Minimalist UI

  • Responsive Design