Web Scraping Engineer

Equity Atlas

Location: New Delhi, Delhi, India
Job type: Full-time

Required skills

Python
Airflow
Backbone
cookies
CSV
data ingestion
data pipeline
end-to-end
full stack
GitHub
HTML
Postgres
web development

About the role

Website: equityatlas.ai
Job details:

Role: Data Infrastructure & Web Scraping Engineer (Finance Focus)

Location: Remote / India

Type: Full-time / Contract (flexible)

MANDATORY - FILL IN THE GOOGLE FORM @ https://forms.gle/6PgnEsGtZ9mnx2xT9

🚀 What you’ll build

We are building a deep financial intelligence platform.

Your job is to design and own a production-grade web data ingestion system that can:

Crawl company level financial data from the web
Extract structured + unstructured data from filings, PDFs, tables, and attachments
Continuously monitor and update datasets (incremental syncs, not brute-force scraping)

This is not a basic scraping job — you will be building a reliable data backbone.

🧠 What we’re looking for

We want someone who thinks like a systems engineer, not just a scraper.

You should be comfortable with:

Core scraping & ingestion

Reverse-engineering websites (XHR, hidden APIs, pagination patterns)
Handling dynamic sites (JS-heavy, auth flows, cookies, headers)
Robust scraping using:
httpx / requests
playwright / headless browsers (when needed)
Designing modular crawlers (not one-off scripts)

Data extraction

Parsing:
HTML tables (messy + inconsistent)
PDFs (text + tables)
Excel / CSV / XBRL
Handling edge cases like:
broken structures
inconsistent schemas
scanned vs digital docs

System design (critical)

Building idempotent pipelines
Incremental ingestion (watermarks, checkpoints)
Deduplication strategies (hashing, row fingerprinting)
Retry logic, failure recovery, resumability
Storage design (raw vs parsed layers)

Automation & scaling

Scheduling recurring crawls
Designing continuous ingestion pipelines
Handling rate limits, throttling, and long-running jobs
Logging, monitoring, and alerting

🏗️ What you’ll own

Design a company-centric ingestion system
Build modular pipelines for filings
Implement:
attachment download + parsing pipelines
structured + unstructured storage
Create a system that:
scales across thousands of companies
runs daily without breaking
survives site changes

🔥 Bonus points

Experience with financial data
Built large-scale crawlers (>1M documents)
Experience with:
Full stack Web Development
Airflow / Prefect / Dagster
Postgres / data warehousing
Strong Python + clean architecture mindset

🧪 What success looks like

We can ingest all filings for a company automatically
New disclosures are picked up without manual intervention
Data is:
clean
deduplicated
queryable
Pipelines run reliably even if:
the website structure changes slightly
jobs fail midway

💡 Why this is interesting

You’ll be building the core data layer of a financial intelligence system
Real-world messy data problems (not toy datasets)
Direct impact on:
ML models
trading strategies
research products

💰 Compensation

Flexible — depends on experience and impact.

(Open to high-quality contractors who can deliver fast.)

📩 How to apply

Send:

Links to past scraping / data pipeline work (GitHub preferred)
Brief explanation of:

the hardest scraping system you’ve built
how you handled failures / scaling

Optional: how you would approach scraping something like exchange filings end-to-end

MANDATORY - FILL IN THE GOOGLE FORM @ https://forms.gle/6PgnEsGtZ9mnx2xT9

If you’ve only built simple scripts, this role is not for you.

If you’ve built systems that survive the real world, we want to talk.

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.