Accolite
Website:
bounteous.com
Job details:
Role Overview
We are seeking a Subject Matter Expert with deep, hands-on expertise across the end-to-end electronic communications processing pipeline. This role requires a specialist who understands every stage of transforming raw, multi-channel communication data into clean, normalised, ML-ready datasets — including data sanitisation, normalisation, de-duplication, and the handling of disclaimers, whitelists, and other noise-reduction techniques.
The SME will serve as the technical authority on data quality for the surveillance program, ensuring that the communications data feeding both deterministic rule engines and transformer-based ML models is accurate, complete, and free from noise that generates false positives. This role bridges the gap between raw channel ingestion (100+ sources) and the detection model pipeline.
Required Qualifications
- 10+ years in data engineering or data processing, with at least 5 years focused on eComms data in financial services
- Deep expertise in text processing and NLP preprocessing: tokenisation, normalisation, encoding handling, language detection, and noise reduction
- Proven experience building production-grade data pipelines for communication surveillance, e-discovery, or regulatory compliance
- Hands-on experience with at least two archiving or surveillance platforms (Global Relay, Smarsh, NICE Actimize, Behavox, Relativity)
- Strong understanding of electronic communication formats: EML, MSG, Bloomberg FLP, Teams/Slack JSON exports, voice transcription formats
- Experience with data de-duplication at scale: fuzzy matching algorithms (MinHash, SimHash, Jaccard), attachment hashing
- Understanding of ASIC INFO 283 data completeness requirements and multi-jurisdiction retention obligations
- Bachelor’s or Master’s degree in Computer Science, Data Science, Computational Linguistics, or Information Science
Preferred Qualifications
- Experience in an Australian Tier-1 bank environment with ASIC, AUSTRAC, and APRA oversight
- Knowledge of WORM-compliant archiving standards and chain-of-custody requirements
- Experience with multi-lingual text processing for APAC languages (Mandarin, Cantonese, Japanese, Malay)
- Familiarity with transformer model data preparation: sub-word tokenisation, attention masking, context windowing
- Experience with real-time streaming pipelines (Kafka Streams, Apache Flink)
Technical Skills & Tools
- Data processing: Apache Kafka, Kafka Streams, Apache Flink, Apache Spark, Azure Data Factory
- Text processing: spaCy, NLTK, regex, Beautiful Soup, Apache Tika, textract
- Languages: Python 3.10+, Java, SQL, Bash scripting
- De-duplication: MinHash, SimHash, Locality-Sensitive Hashing (LSH), ssdeep, TLSH
- Databases: PostgreSQL, Elasticsearch, MongoDB, Azure Cosmos DB
- Infrastructure: Docker, Kubernetes, Azure cloud services, Terraform
- Monitoring: Grafana, Prometheus, ELK Stack, Great Expectations (data quality)
- Communication platforms: Bloomberg Vault, Global Relay Archive, Smarsh Enterprise Archive, Microsoft Purview
Key Responsibilities
Define and govern the end-to-end eComms data processing pipeline architecture: from raw ingestion through sanitisation, normalisation, de-duplication, and enrichment to ML-ready output Design data sanitisation processes: HTML/RTF stripping, embedded image noise removal, email header cleaning, and thread delineation for reply chains Build disclaimer detection and removal systems using pattern matching and ML classifiers — covering legal footers, confidentiality notices, and regulatory boilerplate Develop signature block detection and extraction using structural analysis Design whitelist management frameworks: approved counterparties, internal distribution lists, automated system message exclusions, with periodic review cycles and jurisdiction-specific separation Implement cross-channel message de-duplication: forwarded message detection, near-duplicate fuzzy matching (MinHash/SimHash), attachment fingerprinting, and conversation threading Build entity resolution pipelines: trader identity mapping (aliases, nicknames), counterparty normalisation, and channel-type classification Design metadata enrichment workflows: desk assignment, book mapping, counterparty risk tier, jurisdiction tagging, and timestamp UTC alignment Define and measure data quality KPIs: completeness rate, dedup accuracy, noise removal precision, signal loss rate (target < 2%), and pipeline throughput/latency SLAs
Advise the ML team on data preparation requirements for transformer models: tokenisation strategies, sequence formatting, label engineering, and data augmentation Conduct regular pipeline quality audits and produce data quality scorecards for compliance review Document pipeline specifications, data dictionaries, and operational runbooks for regulatory examination readiness
Click on Apply to know more.