Job Title : Data Engineer / AI Data Pipeline Engineer

Benefits

  • 💸
    Plenty o’ paid time off

    Team members start with 3 weeks of paid time off.

  • 📚
    Education coverage

    Get up to $1,000 a year in skill development covered.

  • 🏝
    Get paid to take a break

    Get a $1,000 bonus the first time you take a vacation that's 5 days or longer.

  • 🧘🏾
    A big focus on health

    $200 monthly wellness stipend, to be used for whatever wellness means for you.

We are looking for a hands-on Data Engineer / AI Data Pipeline Engineer to join our growing engineering team. You'll work on cutting-edge AI-powered data enrichment, taxonomy validation, and scalable reporting frameworks across large-scale retail and enterprise datasets. The role sits at the intersection of data engineering and applied LLMs, and requires strong skills in Python, SQL, AWS cloud services, modern ETL architecture, and LLM-powered automation workflows.

Experience: 3–4 years

Location: Rajarhat-Newtown (Kolkata)

Employment Type: Full-time, Onsite

Timing: Ability to work in the US Eastern time zone. This may be relaxed to half day IST and half day US EST - based on project needs.

Documents : Must have Aadhar Card, Education Certificates that are verifiable, Past company letters ( if applicable) and criminal background clearance.

Key Skills Required:

  • AI-Powered Taxonomy Audit & Enrichment:
    • Design and develop scalable, AI-driven taxonomy audit pipelines for retail store and brand data validation.
    • Build automated workflows leveraging LLMs (GPT-4o / OpenAI APIs) for classification, enrichment, and ontology standardization, using Instructor and Pydantic for reliable structured outputs.
    • Integrate web research and scraping systems (Serper API, ScrapingBee, html2text) to validate structured and unstructured data.
    • Develop human-in-the-loop review workflows using Label Studio for confirm/edit/reject audit processes.
    • Improve taxonomy coverage and entity-resolution accuracy through AI-assisted clustering and enrichment of unmapped transaction data.
  • Data Engineering & Pipeline Development:
    • Build and maintain modular, reusable ETL/data pipeline frameworks.
    • Refactor legacy reporting systems into modern, maintainable architectures with reusable SQL modules and query builders.
    • Develop validation frameworks, logging systems, automated migration workflows, and configurable comparison contexts.
    • Orchestrate workflows with Apache Airflow (DAGs, PythonOperator, XCom) and cloud-native AWS services.
    • Ensure backward compatibility and production stability during migration initiatives.
  • Reporting & Cloud Infrastructure:
    • Develop and optimize advanced SQL queries and reporting pipelines on Amazon Redshift / Redshift Serverless and PostgreSQL (RDS).
    • Manage data workflows using AWS services including S3, Lambda, Glue, CloudWatch, SSM Parameter Store, and Secrets Manager.
    • Monitor production pipelines, troubleshoot issues, and improve performance and reliability.
    • Collaborate with cross-functional teams across Data Engineering, AI/ML, QA, and Product.

Required Skills & Experience:

  • 3–4 years of experience in Python-based data engineering or backend engineering.
  • Strong proficiency in Python, including pandas, requests, psycopg2, and boto3, with solid modular application development.
  • Hands-on experience with Apache Airflow (DAGs, PythonOperator, XCom).
  • Strong advanced SQL skills and a solid grasp of data warehousing concepts.
  • Experience with Amazon Redshift and PostgreSQL.
  • Sound understanding of ETL/data pipeline architecture and workflow orchestration.
  • Hands-on experience with AWS services: S3, Lambda, Glue, CloudWatch, SSM Parameter Store, and Secrets Manager.
  • Experience integrating LLM APIs (GPT-4o / OpenAI) into production workflows.
  • Familiarity with web scraping, search APIs, and data enrichment systems.
  • Experience with Git/GitHub, Jira, and Confluence.
  • Strong debugging, problem-solving, and analytical skills.

Good to Have:

  • Experience with Instructor, Pydantic, or AI workflow orchestration frameworks.
  • Exposure to Label Studio or other human-review annotation systems.
  • Experience with AI-assisted entity resolution and taxonomy/ontology systems.
  • Familiarity with scalable, modular ETL framework design.
  • Background in retail transaction data or taxonomy/master-data management.

Tech Stack:

  • Languages & Libraries: Python, advanced SQL, pandas, boto3, psycopg2, requests
  • Orchestration: Apache Airflow
  • AI / LLM: GPT-4o / OpenAI APIs, Instructor, Pydantic
  • Data & Warehousing: Amazon Redshift / Redshift Serverless, PostgreSQL (RDS)
  • AWS: S3, Lambda, Glue, CloudWatch, SSM Parameter Store, Secrets Manager
  • Scraping & Search: Serper API, ScrapingBee, html2text
  • Human Review: Label Studio
  • Collaboration: Git/GitHub, Jira, Confluence

Preferred Candidate Profile:

  • Self-driven, with end-to-end ownership of data workflows.
  • Comfortable in fast-paced AI/data engineering environments.
  • Strong communication and collaboration skills.
  • Passionate about building scalable, AI-assisted automation systems.

Let's work together

Valid number
Statement of Exceptional Work:
This section is the most important part of your application, so please invest extra effort. Be concise, specific, and concrete - so we can accurately assess your expertise and its alignment with our expectations.
Character count should not be more than 10000.
Character count should not be more than 5000.
Max file size 10MB.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.