Job Title : Data Engineer / AI Data Pipeline Engineer

We are looking for a hands-on Data Engineer / AI Data Pipeline Engineer to join our growing engineering team. You'll work on cutting-edge AI-powered data enrichment, taxonomy validation, and scalable reporting frameworks across large-scale retail and enterprise datasets. The role sits at the intersection of data engineering and applied LLMs, and requires strong skills in Python, SQL, AWS cloud services, modern ETL architecture, and LLM-powered automation workflows.

Benefits

💸
Plenty o’ paid time off
Team members start with 3 weeks of paid time off.
📚
Education coverage
Get up to $1,000 a year in skill development covered.
🏝
Get paid to take a break
Get a $1,000 bonus the first time you take a vacation that's 5 days or longer.
🧘🏾
A big focus on health
$200 monthly wellness stipend, to be used for whatever wellness means for you.

Kolkata, Rajarhat

Full Time

IT Technology

‍

Experience: 3–4 years

‍Location: Rajarhat-Newtown (Kolkata)

‍Employment Type: Full-time, Onsite

‍Timing: Ability to work in the US Eastern time zone. This may be relaxed to half day IST and half day US EST - based on project needs.

‍Documents : Must have Aadhar Card, Education Certificates that are verifiable, Past company letters ( if applicable) and criminal background clearance.

‍

Key Skills Required:

AI-Powered Taxonomy Audit & Enrichment:
- Design and develop scalable, AI-driven taxonomy audit pipelines for retail store and brand data validation.
- Build automated workflows leveraging LLMs (GPT-4o / OpenAI APIs) for classification, enrichment, and ontology standardization, using Instructor and Pydantic for reliable structured outputs.
- Integrate web research and scraping systems (Serper API, ScrapingBee, html2text) to validate structured and unstructured data.
- Develop human-in-the-loop review workflows using Label Studio for confirm/edit/reject audit processes.
- Improve taxonomy coverage and entity-resolution accuracy through AI-assisted clustering and enrichment of unmapped transaction data.
Data Engineering & Pipeline Development:
- Build and maintain modular, reusable ETL/data pipeline frameworks.
- Refactor legacy reporting systems into modern, maintainable architectures with reusable SQL modules and query builders.
- Develop validation frameworks, logging systems, automated migration workflows, and configurable comparison contexts.
- Orchestrate workflows with Apache Airflow (DAGs, PythonOperator, XCom) and cloud-native AWS services.
- Ensure backward compatibility and production stability during migration initiatives.
Reporting & Cloud Infrastructure:
- Develop and optimize advanced SQL queries and reporting pipelines on Amazon Redshift / Redshift Serverless and PostgreSQL (RDS).
- Manage data workflows using AWS services including S3, Lambda, Glue, CloudWatch, SSM Parameter Store, and Secrets Manager.
- Monitor production pipelines, troubleshoot issues, and improve performance and reliability.
- Collaborate with cross-functional teams across Data Engineering, AI/ML, QA, and Product.

Required Skills & Experience:

3–4 years of experience in Python-based data engineering or backend engineering.
Strong proficiency in Python, including pandas, requests, psycopg2, and boto3, with solid modular application development.
Hands-on experience with Apache Airflow (DAGs, PythonOperator, XCom).
Strong advanced SQL skills and a solid grasp of data warehousing concepts.
Experience with Amazon Redshift and PostgreSQL.
Sound understanding of ETL/data pipeline architecture and workflow orchestration.
Hands-on experience with AWS services: S3, Lambda, Glue, CloudWatch, SSM Parameter Store, and Secrets Manager.
Experience integrating LLM APIs (GPT-4o / OpenAI) into production workflows.
Familiarity with web scraping, search APIs, and data enrichment systems.
Experience with Git/GitHub, Jira, and Confluence.
Strong debugging, problem-solving, and analytical skills.

‍

Good to Have:

Experience with Instructor, Pydantic, or AI workflow orchestration frameworks.
Exposure to Label Studio or other human-review annotation systems.
Experience with AI-assisted entity resolution and taxonomy/ontology systems.
Familiarity with scalable, modular ETL framework design.
Background in retail transaction data or taxonomy/master-data management.

‍

Tech Stack:

Languages & Libraries: Python, advanced SQL, pandas, boto3, psycopg2, requests
Orchestration: Apache Airflow
AI / LLM: GPT-4o / OpenAI APIs, Instructor, Pydantic
Data & Warehousing: Amazon Redshift / Redshift Serverless, PostgreSQL (RDS)
AWS: S3, Lambda, Glue, CloudWatch, SSM Parameter Store, Secrets Manager
Scraping & Search: Serper API, ScrapingBee, html2text
Human Review: Label Studio
Collaboration: Git/GitHub, Jira, Confluence

‍

Preferred Candidate Profile:

Self-driven, with end-to-end ownership of data workflows.
Comfortable in fast-paced AI/data engineering environments.
Strong communication and collaboration skills.
Passionate about building scalable, AI-assisted automation systems.

‍

Let's work together

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Job Title : Data Engineer / AI Data Pipeline Engineer

Benefits

💸

Plenty o’ paid time off

📚

Education coverage

🏝

Get paid to take a break

🧘🏾

A big focus on health

Location

Type

Department

Key Skills Required:

Required Skills & Experience:

Good to Have:

Tech Stack:

Preferred Candidate Profile:

Let's work together

What We Do

Clients

Products

Our Skills

Partners

Company