AI Data Scraper

Minimal AI web scraper

Year
2025
Type of Project
Web scraping & data preprocessing utility
My Role
Frontend Engineer & AI Integrator

Case Study

Objective

Create a lightweight, browser-based tool to responsibly scrape and preprocess website text for AI training, supporting both raw text extraction and structured JSON output driven by custom prompts. Provide non-technical users with an easy way to batch-process multiple URLs and export AI-ready text or JSON data while enforcing basic ethical checks.

https://github.com/ujjwalredd?tab=repositories

Process

Implemented a React + TypeScript single-page application with Tailwind CSS for a clean, responsive UI that works on desktop and mobile.
Integrated Google Gemini API as the backend logic to transform scraped content into either cleaned text or structured JSON based on user-defined prompts.
Added dual modes (Text and JSON), batch URL input via textarea or .txt upload, concurrent processing for multiple URLs, and real-time status display per URL.
Built utilities to copy results to clipboard or download them as .txt/.json files, and wired an initial copyright “respect-first” check that blocks scraping when restrictions are detected.

Outcome

Delivered an interactive web app that can process multiple URLs at once and return AI-ready text or structured JSON suitable for building training corpora or small domain-specific datasets.
Improved data collection workflows by combining scraping, prompt-based structuring, and export options into a single, minimal interface requiring only a browser and an API key.

Standout Features

Dual-mode, prompt-driven extraction
Batch Processing
Ethical Copyright Check
Concurrent Processing
Download & Copy
Clean & Minimalist UI
Responsive Design