A Q&A with R Systems’ AI Director Samiksha Mishra

(3rdtimeluckystudio/Shutterstock)

Organizations are waking up to the fact that how they source data to train AI models is just as important as how the AI models themselves are developed. In fact, the data arguably is more important, which is why it’s critical to understand the entirety of the data supply chain backing your AI work. That’s the topic of a conversation we recently had with Samiksha Mishra, the director of AI at R Systems.

R Systems is an India-based provider of product engineering solutions, including data science and AI. As the director of AI, Mishra–who has a PhD in Artificial Intelligence and NLP from the Dr. A.P.J. Abdul Kalam Technical University in Lucknow, India–has a large influence in how the company helps clients position themselves for success with AI.

BigDATAwire recently conducted an email-based Q&A with Mishra on the topic of data supply chains. Here’s a lightly edited transcript of that conversation.

BigDATAwire: You’ve said that AI bias isn’t just a model problem but a “data supply chain” problem. Can you explain what you mean by that?

Samiksha Mishra: When I say that AI bias isn’t just a model problem but a data supply chain problem, I mean that harmful bias often enters systems before the model is trained.

Think of data as moving through a supply chain: it’s sourced, labeled, cleaned, transformed, and then fed into models. If bias enters early – through underrepresentation in data collection, skewed labeling, or feature engineering – it doesn’t just persist but multiplies as the data moves downstream. By the time the model is trained, bias is deeply entrenched, and fixes can only patch symptoms, not address the root cause.

Samiksha Mishra is the director of AI at R Systems

Just like supply chains for physical goods need quality checks at every stage, AI systems need fairness validation points throughout the pipeline to prevent bias from becoming systemic.

BDW: Why do you think organizations tend to focus more on bias mitigation at the algorithm stage rather than earlier in the pipeline?

SM: Organizations often favor algorithm-level bias mitigation because it is efficient and practical to start with. It tends to be cheaper and faster to implement than a full overhaul of data pipelines. It also provides measurable and auditable fairness metrics that support governance and transparency. Additionally, this approach minimizes organizational upheaval, avoiding broad shifts in processes and infrastructure. However, researchers caution that data-level biases can still creep in, underscoring the need for ongoing monitoring and tuning.

BDW: At which stages of the AI data supply chain – acquisition, preprocessing, ingestion – are you seeing the most bias introduced?

SM: The most significant bias is found in the data collection stage. This is the foundational point where sampling bias (datasets not representative of the population) and historical bias (data reflecting societal inequities) are continually introduced. Because all subsequent stages operate on this initial data, any biases present here are amplified throughout the AI development process.

Data cleaning and preprocessing can introduce further bias through human judgment in labeling and feature selection, and data augmentation can reinforce existing patterns. Yet these issues are often a direct result of the foundational biases already present in the collected data. That’s why the acquisition stage is the primary entry point.

BDW: How can bias “multiply exponentially” as data moves through the supply chain?

SM: The key issue is that a small representational bias can be significantly amplified across the AI data supply chain due to reusability and interdependencies. When a biased dataset is reused, its initial flaw is propagated to multiple models and contexts. This is further magnified during preprocessing, as methods like feature scaling and augmentation can encode a biased feature into multiple new variables, effectively multiplying its weight.

Furthermore, a bias is exacerbated by algorithms that prioritize overall accuracy, causing minority-group errors to be overlooked.

Finally, the interconnected nature of the modern machine learning ecosystem means that a bias in one upstream component, such as a pretrained model or dataset, can cascade through the entire supply chain, amplifying its impact across diverse domains such as healthcare, hiring, and credit scoring.

BDW: What strategy do you recommend implementing from the moment data is sourced?

SM: If you want to keep AI bias from multiplying across the pipeline, the best strategy is to set up validation checkpoints from the very moment data is sourced. That means starting with distributional audits to check whether demographic groups are fairly represented and using tools like Skyline datasets to simulate coverage gaps.

During annotation and preprocessing, you must validate label quality with inter-annotator agreement metrics and strip out proxy features that can sneak in bias. At the training stage, models should optimize not just for accuracy but also fairness by including fairness terms in the loss function and monitoring subgroup performance. Before deployment, stress testing with counterfactuals and subgroup robustness checks helps catch hidden disparities. Finally, once the model is live, real-time fairness dashboards, dynamic auditing frameworks, and drift detectors keep the system honest over time.

In short, checkpoints at each stage, data, annotation, training, validation, and deployment act like guardrails, ensuring fairness is continuously monitored rather than patched in at the end.

BDW: How can validation layers and bias filters be built into AI systems without compromising performance or speed?

SM: One effective way to integrate validation layers and bias filters into AI systems without sacrificing speed is to design them as lightweight checkpoints throughout the pipeline rather than heavy post-hoc add-ons. At the data stage, simple distributional checks such as χ² tests or KL-divergence can flag demographic imbalances at low computational cost. During training, fairness constraints can be embedded directly into the loss function so the optimizer balances accuracy and fairness simultaneously, rather than retraining models later. Research shows that such fairness-aware optimization adds minimal overhead while preventing biases from compounding.

(GoodIdeas/Shutterstock)

At validation and deployment, efficiency comes from parallelization and modularity. Fairness metrics like Equalized Odds or Demographic Parity can be computed in parallel with accuracy metrics, and bias filters can be structured as microservices or streaming monitors that check for drift incrementally. This means fairness audits run continuously but do not slow down prediction latency. By treating fairness as a set of modular, lightweight processes rather than afterthought patches, organizations can maintain both high performance and real-time responsiveness while ensuring models are equitable

BDW: How can a sandbox environment with more representative data help reduce bias?

SM: In human resources, recruitment platforms can be trained with ranking algorithms on historical hiring data, which can often reflect past gender imbalances. This introduces the risk of perpetuating bias in new hiring decisions. For instance, a model trained on data that historically favors male candidates in tech roles may learn to rank men higher, even when female candidates have equivalent qualifications.

A sandbox approach is often used to address challenges like this.

Before deployment, the hiring model is tested in an isolated, simulated environment. It is run against a synthetic dataset designed to be perfectly representative and balanced, with gender and other demographic attributes equally distributed and randomized across skill and experience levels.

Within this controlled setting, the model’s performance is measured using fairness metrics, such as Demographic Parity (ensuring equal selection rates across groups) and Equal Opportunity Difference (checking for equal true positive rates). If these metrics reveal a bias, mitigation strategies are applied. These may include reweighting features, using fairness-constrained optimization during training, or employing adversarial debiasing techniques to reduce the model’s reliance on protected attributes.

This pre-deployment validation ensures the system is calibrated for fairness under representative conditions, reducing the risk of biased historical data distorting real-world hiring outcomes.

(pichetw/Shutterstock)

BDW: What are the biggest obstacles preventing companies from adopting a supply chain approach to bias mitigation?

SM: Organizations prefer to implement algorithmic fairness metrics (e.g., Equalized Odds, Demographic Parity) because they are easier to apply late in the pipeline. This narrow approach ignores how compounded bias in data preparation already constrains fairness outcomes.

Organizations also often prioritize short-term efficiency and innovation speed over embedding ethical checkpoints at every stage of the AI pipeline. This leads to fragmented accountability, where bias in data sourcing or preprocessing is overlooked because responsibility is pushed downstream to algorithm developers.

BDW: Are there specific industries where this approach is especially urgent or where the consequences of biased AI outputs are most severe?

SM: In addition to human resources, as I mentioned earlier, biased AI outputs are most severe in high-stakes industries such as healthcare, finance, criminal justice, and education, where decisions directly impact people’s lives and opportunities.

In healthcare, specifically, biased diagnostic algorithms risk exacerbating health disparities by misclassifying conditions in underrepresented populations.

Financial systems face similar challenges, as machine learning models used in credit scoring can reproduce historical discrimination, systematically denying loans to minority groups.

These examples demonstrate that adopting a supply chain approach to bias mitigation is most urgent in sectors where algorithmic bias translates into inequity, harm, and systemic discrimination.

BDW: What’s one change companies could make today that would have the biggest impact on reducing bias in their AI systems long-term?

(Lightspring/Shutterstock)

SM: I believe that there are two changes that organization can make today that will have a tremendous impact on reducing bias.

First, they should establish a diverse, interdisciplinary team with a mandate for ethical AI development and oversight. While technical solutions like using diverse datasets, fairness-aware algorithms, and continuous monitoring are crucial, they are often reactive or can miss biases that only a human perspective can identify. A diverse, interdisciplinary team tackles the problem at its root – the people and processes that build the AI.

Second, organizations should begin treating data governance as an important step, on par with model development. That means establishing rigorous processes for sourcing, documenting, and validating datasets before they ever reach the training pipeline. By implementing standardized practices like datasheets for datasets or model cards and requiring demographic balance checks at the point of collection, organizations can prevent the majority of bias from entering the system in the first place.

Later algorithmic fixes can only partially compensate once biased data flows into model training. Still, strong governance at the data layer creates a foundation for fairness that compounds over time.

Both of these solutions are organizational and cultural changes that establish a solid foundation, ensuring all other technical and process improvements are effective and sustainable over the long term.

BDW: Thank you for your insights on data bias and supply chain problems.

Data Quality Getting Worse, Report Says

Kinks in the Data Supply Chain

Tags:
accuracy metrics, AI development, data annovation, data bias, data cleasning, data labeling, data supply chain, data transformation, model bias, Samiksha Misra, sandbox enviornment, validation checkpoints