Website:
equityatlas.ai
Job details:
Role: Data Infrastructure & Web Scraping Engineer (Finance Focus)
Location: Remote / India
Type: Full-time / Contract (flexible)
MANDATORY - FILL IN THE GOOGLE FORM @ https://forms.gle/6PgnEsGtZ9mnx2xT9
🚀 What you’ll build
We are building a deep financial intelligence platform.
Your job is to design and own a production-grade web data ingestion system that can:
- Crawl company level financial data from the web
- Extract structured + unstructured data from filings, PDFs, tables, and attachments
- Continuously monitor and update datasets (incremental syncs, not brute-force scraping)
This is not a basic scraping job — you will be building a reliable data backbone.
🧠What we’re looking for
We want someone who thinks like a systems engineer, not just a scraper.
You should be comfortable with:
Core scraping & ingestion
- Reverse-engineering websites (XHR, hidden APIs, pagination patterns)
- Handling dynamic sites (JS-heavy, auth flows, cookies, headers)
- Robust scraping using:
- httpx / requests
- playwright / headless browsers (when needed)
- Designing modular crawlers (not one-off scripts)
Data extraction
- Parsing:
- HTML tables (messy + inconsistent)
- PDFs (text + tables)
- Excel / CSV / XBRL
- Handling edge cases like:
- broken structures
- inconsistent schemas
- scanned vs digital docs
System design (critical)
- Building idempotent pipelines
- Incremental ingestion (watermarks, checkpoints)
- Deduplication strategies (hashing, row fingerprinting)
- Retry logic, failure recovery, resumability
- Storage design (raw vs parsed layers)
Automation & scaling
- Scheduling recurring crawls
- Designing continuous ingestion pipelines
- Handling rate limits, throttling, and long-running jobs
- Logging, monitoring, and alerting
🏗️ What you’ll own
- Design a company-centric ingestion system
- Build modular pipelines for filings
- Implement:
- attachment download + parsing pipelines
- structured + unstructured storage
- Create a system that:
- scales across thousands of companies
- runs daily without breaking
- survives site changes
🔥 Bonus points
- Experience with financial data
- Built large-scale crawlers (>1M documents)
- Experience with:
- Full stack Web Development
- Airflow / Prefect / Dagster
- Postgres / data warehousing
- Strong Python + clean architecture mindset
đź§Ş What success looks like
- We can ingest all filings for a company automatically
- New disclosures are picked up without manual intervention
- Data is:
- clean
- deduplicated
- queryable
- Pipelines run reliably even if:
- the website structure changes slightly
- jobs fail midway
đź’ˇ Why this is interesting
- You’ll be building the core data layer of a financial intelligence system
- Real-world messy data problems (not toy datasets)
- Direct impact on:
- ML models
- trading strategies
- research products
đź’° Compensation
Flexible — depends on experience and impact.
(Open to high-quality contractors who can deliver fast.)
đź“© How to apply
Send:
- Links to past scraping / data pipeline work (GitHub preferred)
- Brief explanation of:
- the hardest scraping system you’ve built
- how you handled failures / scaling
- Optional: how you would approach scraping something like exchange filings end-to-end
MANDATORY - FILL IN THE GOOGLE FORM @ https://forms.gle/6PgnEsGtZ9mnx2xT9
If you’ve only built simple scripts, this role is not for you.
- If you’ve built systems that survive the real world, we want to talk.
Click on Apply to know more.