Flag job

Report

Data Engineer (Web Scraper)- Intern

Min Experience

0 years

Location

remote

JobType

internship

About the job

Info This job is sourced from a job board

About the role

We're looking for a skilled Web Scraping Data Engineer (Intern) to design and implement robust data extraction systems. In this role, you'll develop scalable crawling architectures to collect high-quality data while ensuring compliance with ethical standards and data regulations. Key Responsibilities Design and maintain efficient web crawling systems using frameworks like Scrapy, Playwright, or Selenium Implement data processing pipelines to clean, normalize, and structure extracted content Optimize crawling strategies to improve efficiency while respecting website policies Develop monitoring systems to identify and resolve scraping issues quickly Deliver high-quality datasets for analysis and model training Implement storage solutions for large-scale data management Ensure compliance with data regulations and ethical scraping practices Required Skills Strong Python programming experience. Good to know SQL. Hands-on experience with web scraping tools (BeautifulSoup, Scrapy, Selenium) Proficiency with HTML, JavaScript, and HTTP protocols Experience with data processing libraries (pandas, PySpark) Familiarity with Linux/UNIX environments Knowledge of version control systems and code review practices Strong problem-solving abilities and attention to detail Excellent communication skills (written and verbal English) Good to have :(Optional) Familiarity with AI frameworks (Hugging Face, LangChain, OpenAI) Familiarity with LLM training pipelines and data requirements Experience with text data augmentation and synthetic data generation Preferred Qualifications Experience with large-scale distributed crawling systems Knowledge of proxy management and anti-bot evasion techniques Familiarity with any cloud platforms (AWS, GCP, Azure) Experience with containerization (Docker, Kubernetes)

Skills

python
sql
scrapy
selenium
html
javascript
http
pandas
pyspark
linux
version control
data processing