Subject Matter Expert — eComms Processing Pipeline

Accolite

Location: Gurgaon, Haryana, India
Job type: Full-time

Required skills

Python
Apache
Apache Flink
Apache Kafka
Apache Spark
Azure
Bash
compliance
data science
Docker
Elasticsearch
end-to-end
Flink
HTML
Java
JSON
Kafka
Kubernetes
NLP
PostgreSQL
Slack
Spark
SQL
Terraform
Vault

About the role

Accolite

Website: bounteous.com
Job details:
Role Overview

We are seeking a Subject Matter Expert with deep, hands-on expertise across the end-to-end electronic communications processing pipeline. This role requires a specialist who understands every stage of transforming raw, multi-channel communication data into clean, normalised, ML-ready datasets — including data sanitisation, normalisation, de-duplication, and the handling of disclaimers, whitelists, and other noise-reduction techniques.

The SME will serve as the technical authority on data quality for the surveillance program, ensuring that the communications data feeding both deterministic rule engines and transformer-based ML models is accurate, complete, and free from noise that generates false positives. This role bridges the gap between raw channel ingestion (100+ sources) and the detection model pipeline.

Required Qualifications

10+ years in data engineering or data processing, with at least 5 years focused on eComms data in financial services
Deep expertise in text processing and NLP preprocessing: tokenisation, normalisation, encoding handling, language detection, and noise reduction
Proven experience building production-grade data pipelines for communication surveillance, e-discovery, or regulatory compliance
Hands-on experience with at least two archiving or surveillance platforms (Global Relay, Smarsh, NICE Actimize, Behavox, Relativity)
Strong understanding of electronic communication formats: EML, MSG, Bloomberg FLP, Teams/Slack JSON exports, voice transcription formats
Experience with data de-duplication at scale: fuzzy matching algorithms (MinHash, SimHash, Jaccard), attachment hashing
Understanding of ASIC INFO 283 data completeness requirements and multi-jurisdiction retention obligations
Bachelor’s or Master’s degree in Computer Science, Data Science, Computational Linguistics, or Information Science

Preferred Qualifications

Experience in an Australian Tier-1 bank environment with ASIC, AUSTRAC, and APRA oversight
Knowledge of WORM-compliant archiving standards and chain-of-custody requirements
Experience with multi-lingual text processing for APAC languages (Mandarin, Cantonese, Japanese, Malay)
Familiarity with transformer model data preparation: sub-word tokenisation, attention masking, context windowing
Experience with real-time streaming pipelines (Kafka Streams, Apache Flink)

Technical Skills & Tools

Data processing: Apache Kafka, Kafka Streams, Apache Flink, Apache Spark, Azure Data Factory
Text processing: spaCy, NLTK, regex, Beautiful Soup, Apache Tika, textract
Languages: Python 3.10+, Java, SQL, Bash scripting
De-duplication: MinHash, SimHash, Locality-Sensitive Hashing (LSH), ssdeep, TLSH
Databases: PostgreSQL, Elasticsearch, MongoDB, Azure Cosmos DB
Infrastructure: Docker, Kubernetes, Azure cloud services, Terraform
Monitoring: Grafana, Prometheus, ELK Stack, Great Expectations (data quality)
Communication platforms: Bloomberg Vault, Global Relay Archive, Smarsh Enterprise Archive, Microsoft Purview

Key Responsibilities

Define and govern the end-to-end eComms data processing pipeline architecture: from raw ingestion through sanitisation, normalisation, de-duplication, and enrichment to ML-ready output

Design data sanitisation processes: HTML/RTF stripping, embedded image noise removal, email header cleaning, and thread delineation for reply chains

Build disclaimer detection and removal systems using pattern matching and ML classifiers — covering legal footers, confidentiality notices, and regulatory boilerplate

Develop signature block detection and extraction using structural analysis

Design whitelist management frameworks: approved counterparties, internal distribution lists, automated system message exclusions, with periodic review cycles and jurisdiction-specific separation

Implement cross-channel message de-duplication: forwarded message detection, near-duplicate fuzzy matching (MinHash/SimHash), attachment fingerprinting, and conversation threading

Build entity resolution pipelines: trader identity mapping (aliases, nicknames), counterparty normalisation, and channel-type classification

Design metadata enrichment workflows: desk assignment, book mapping, counterparty risk tier, jurisdiction tagging, and timestamp UTC alignment

Define and measure data quality KPIs: completeness rate, dedup accuracy, noise removal precision, signal loss rate (target < 2%), and pipeline throughput/latency SLAs

Advise the ML team on data preparation requirements for transformer models: tokenisation strategies, sequence formatting, label engineering, and data augmentation

Conduct regular pipeline quality audits and produce data quality scorecards for compliance review

Document pipeline specifications, data dictionaries, and operational runbooks for regulatory examination readiness

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.