Indian Food Ingredients & Label Variants

A mapping of 2,500+ regional ingredient variations observed on Indian food labels, linking label variants to a canonical vocabulary. Note: this dataset has been superseded — the v1 approach was abandoned after finding it conflated noise reduction with meaningful cultural and linguistic variation.
Author
Affiliation

Lalitha A R

iSRL

Published

February 1, 2026

Doi
Abstract

This dataset has been superseded. The v1 mapping approach — standardising ingredient label variants to a canonical vocabulary — was found to conflate noise reduction with meaningful cultural and linguistic variation. This document explains why the approach was abandoned and what replaced it.

A mapping of 2500+ regional ingredient variations as observed in Indian labels. This dataset provides a structured mapping of the diverse ways ingredients are named on Indian food packaging, linking variants (the actual text found on labels) to a canon (a standardised, clean category).

Example mapping: Canon: Acetic Acid (INS 260) — Variants: acidity regulator 260, vinegar, ins 260, acetic acid (260).

Contributors

Subrat Sethi

Purnendu Shukla

ImportantThis version has been superseded

This dataset is no longer maintained. The v1 approach was found to be structurally inadequate for the problem it was designed to solve. The full reasoning is documented below. The dataset remains available for reference at the link above.

For current work, see the Identity, Transformation, and Function framework and its justification companion.

We released Indian Food Ingredients & Label Variants (v1) with the goal of making ingredient label text parseable by machines. The dataset standardised ingredient names — mapping kashmiri chilli to chilli, for instance — on the assumption that a normalised vocabulary would make automated parsing tractable.

Two problems emerged as data collection continued.

First, the approach trades away information the project is now explicitly committed to preserving. The data makes this concrete.

                            canon               variant
0                      A2 Protein            a2 protein
1  Acesulfame Potassium (INS 950)          acesulfame k
2  Acesulfame Potassium (INS 950)  acesulfame potassium
3  Acesulfame Potassium (INS 950)     sweetener ins 950
4           Acetic Acid (INS 260)           acetic acid
import pandas as pd

df = pd.read_csv("data/ingredients.csv", header=None, names=["canon", "variant"])

# All variants that map to Chilli in v1
chilli = df[df["canon"] == "Chilli"].copy()

# The ones that carry regional and variety-level identity
regional = chilli[chilli["variant"].str.contains(
    "kashmiri|mathania|jalapeño|lal mirch", case=False
)].reset_index(drop=True)

print(regional.to_string(index=False))
 canon                                                          variant
Chilli                                                  kashmiri chilli
Chilli                                               kashmiri lal mirch
Chilli                                                    mild jalapeño
Chilli salt with spices and condiments chillies and capsicum lal mirchi
Chilli                 spices and condiments kashmiri red chilli powder
Chilli                 spices and condiments mathania red chilli powder
Chilli                                      stalkless kashmiri chillies

In v1, every row above maps to Chilli. Kashmiri lal mirch, mathania red chilli powder, stalkless kashmiri chillies — all collapsed into the same canon as chili powder and red chilly flakes.

The brands that wrote these labels did not have to. Kashmiri chilli could have been declared as chilli — it would have been legally compliant. The choice to name it specifically was a choice to preserve something: a regional identity, a flavour profile, a cultural referent that Indian consumers recognise and reach for. The v1 mapping erases that choice.

This is not only a question of cultural fidelity. Ingredient identity has legal and fiscal consequences. Fresh alphonso mangoes attract 0% GST as an agricultural produce; mango pulp processed from a specific GI-tagged variety enters a different regulatory category. Kashmiri chilli carries a Geographical Indication; a generic chilli does not. When a mapping table collapses these into one canon, it does not simplify the data — it destroys the signal that downstream regulatory, taxation, and traceability systems depend on. Respecting the taste of India is not a sentiment; it is a data integrity requirement.

Second, the ingredient name space in Indian packaged food is too diverse for automated mapping to be reliable. The problem splits into two structurally different cases:

Maintaining a single mapping table that handles both cases conflates the problem. In practice, it means tracking every normalisation decision made during data cleaning — effectively a log of every typo fixed across thousands of rows — with no mechanism to distinguish meaningful variation from noise.

The ingredient substrate under development makes this mapping unnecessary. A deterministic identity layer — one that assigns canonical identifiers to ingredients independent of how they are written on any given label — eliminates the need for probabilistic name matching at parse time. Labels are parsed against the substrate, not against a maintained vocabulary of variants.

The v1 dataset will remain available for reference. The label variants mapping will not be maintained going forward.


This brings us to the question of how we extract the variants in a way that preserves the signal.

How do we formalise that milk solids feels like it should be under milk while butter feels different? How do we measure the distance between a variant and its source ingredient?

These questions led to a food classification framework inspired by Ranganathan’s 1933 Colon Classification12 and grounded in Indian judicial and regulatory precedents — FSSAI, ITC-HS, court rulings.

1 Colon Classification (Faceted Classification) by S R Ranganathan, Father of Indian Library Science.

2 Instead of a flat list, faceted classification lets us express a single object as a set of values across independent dimensions — the way filtering by price, type, and brand on Amazon works, rather than browsing a single ranked list.

Reuse

Citation

BibTeX citation:
@dataset{a_r2026,
  author = {A R, Lalitha},
  publisher = {iSRL},
  title = {Indian {Food} {Ingredients} \& {Label} {Variants}},
  number = {iSRL-26-02-DS-Variants},
  date = {2026-02-01},
  url = {https://isrl-research.github.io/pub/2026-02-ds-variants/},
  doi = {10.5281/zenodo.1871452},
  langid = {en},
  abstract = {**This dataset has been superseded.** The v1 mapping
    approach — standardising ingredient label variants to a canonical
    vocabulary — was found to conflate noise reduction with meaningful
    cultural and linguistic variation. This document explains why the
    approach was abandoned and what replaced it. A mapping of 2500+
    regional ingredient variations as observed in Indian labels. This
    dataset provides a structured mapping of the diverse ways
    ingredients are named on Indian food packaging, linking variants
    (the actual text found on labels) to a canon (a standardised, clean
    category). Example mapping: Canon: Acetic Acid (INS 260) — Variants:
    acidity regulator 260, vinegar, ins 260, acetic acid (260).}
}
For attribution, please cite this work as:
A R, Lalitha. 2026. “Indian Food Ingredients & Label Variants.” iSRL-26-02-DS-Variants. iSRL, February 1. https://doi.org/10.5281/zenodo.1871452.