<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Data and AI blog]]></title><description><![CDATA[Blog about big data and AI with quick and fun reads! Expect quick and engaging insights, playful explorations of complex topics, and the occasional demos.]]></description><link>https://blog.akashja.in</link><generator>RSS for Node</generator><lastBuildDate>Wed, 22 Apr 2026 09:01:25 GMT</lastBuildDate><atom:link href="https://blog.akashja.in/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Data Virtualisation: Reshaping How Enterprises Access Data]]></title><description><![CDATA[Data has always been the lifeblood of enterprise decision-making. But for decades, getting that data to the right person, at the right time, in the right format has been a costly, fragile, and frustra]]></description><link>https://blog.akashja.in/data-virtualisation-reshaping-how-enterprises-access-data</link><guid isPermaLink="true">https://blog.akashja.in/data-virtualisation-reshaping-how-enterprises-access-data</guid><category><![CDATA[DataVisualization]]></category><category><![CDATA[dataengineering]]></category><category><![CDATA[#DataArchitecture]]></category><category><![CDATA[ETL]]></category><category><![CDATA[denodo]]></category><category><![CDATA[#Data Fabric Architecture]]></category><category><![CDATA[DataIntegration]]></category><dc:creator><![CDATA[Akash Jain]]></dc:creator><pubDate>Sun, 12 Apr 2026 18:41:19 GMT</pubDate><content:encoded><![CDATA[<p>Data has always been the lifeblood of enterprise decision-making. But for decades, getting that data to the right person, at the right time, in the right format has been a costly, fragile, and frustrating endeavour. Extract, transform, load. Wait. Repeat. While ETL pipelines quietly chugged away in the background, businesses were left looking at yesterday's data trying to make tomorrow's decisions.</p>
<p>Data virtualisation challenges that status quo — fundamentally. And the industry is finally paying attention.</p>
<hr />
<h2>1. History of Data Virtualisation: The Journey Until Now</h2>
<p>The story of data virtualisation didn’t begin with a grand architectural vision. It began the way most enterprise innovations do — with a pile of problems and a growing sense of “there has to be a better way.”</p>
<p><strong>The 1990s: The Federated Query Era</strong></p>
<p>In the early 1990s, as relational databases proliferated across enterprises, the need to query across multiple systems simultaneously became apparent. <strong>IBM's federated database technology, built into DB2, was one of the earliest attempts at this</strong>. You could write a single SQL query that would fan out to multiple databases and bring the results back together. It was clunky, limited, and slow — but the seed of an idea had been planted.</p>
<p><strong>Early 2000s: Enterprise Information Integration (EII)</strong></p>
<p>By the early 2000s, computing power had grown significantly, and vendors began positioning federated query capabilities as a broader category: <strong>Enterprise Information Integration</strong>, or <strong>EII</strong>. The term was first coined by <em>MetaMatrix</em> and represented a fundamental shift — rather than physically moving data, you would create a virtual layer that made disparate sources look like a single, unified database. <a href="https://www.denodo.com/en">Denodo</a> released its very first version (v1.0) in 2002, built from the ground up around this concept.</p>
<p>The appeal was obvious: no data replication, no staging tables, no brittle ETL pipelines. Just a logical layer that abstracted the complexity underneath.</p>
<p><strong>2005–2015: The Rise of the Data Warehouse — and the Cracks</strong></p>
<p>During the same period, the data warehouse was at its peak. Teradata, Netezza, and SQL Server became the backbone of enterprise analytics. But warehouses had a problem: they required everything to be moved, transformed, and loaded before it could be queried. As the volume and variety of data exploded — driven by digital transformation, mobile, and the early SaaS wave — the ETL bottleneck became impossible to ignore.</p>
<p>Hadoop arrived around 2008 as a response to the scale problem. But Hadoop was complex, opaque, and slow for interactive analytics. The result? Enterprises ended up with a patchwork: warehouses, data lakes, operational databases, and dozens of SaaS platforms — all siloed, all requiring their own integration pipelines. This fragmentation didn't just create a data problem — it created an industry. ETL tools, integration middleware, and data pipeline vendors collectively grew into a multi-billion dollar category built almost entirely around the cost of moving data that probably shouldn't have needed moving.</p>
<p>This fragmentation created exactly the conditions data virtualisation was built for.</p>
<p><strong>2015–2020: Data Virtualisation Goes Mainstream</strong></p>
<p>As cloud computing matured and the microservices revolution dispersed data even further across organisations, virtualisation platforms like <a href="https://www.denodo.com/en"><strong>Denodo</strong></a>, <a href="https://www.tibco.com/glossary/what-is-data-virtualization#:~:text=Data%20virtualization%20software%20connects%20multiple%20data%20sources,Hundreds%20of%20projects%20*%20Thousands%20of%20users"><strong>TIBCO Data Virtualization</strong></a>, and <a href="https://www.ibm.com/products/watson-query"><strong>IBM Data Virtualisation</strong></a> began gaining serious enterprise traction. They were no longer niche EII tools — they were positioned as semantic layers and data fabric enablers, providing unified access, governance, lineage, and security across the entire data estate.</p>
<p>The concept of the <a href="https://www.gartner.com/en/data-analytics/topics/data-fabric"><strong>Data</strong> Fab<strong>ric</strong></a> — which Gartner began evangelising heavily from 2019 onwards — placed data virtualisation at its architectural heart.</p>
<p><strong>2020–Present: The Lakehouse Era and a New Question</strong></p>
<p>The emergence of the lakehouse (Databricks, Snowflake, Apache Iceberg) and distributed SQL engines (<a href="https://trino.io">Trino</a>, Starburst, <a href="https://duckdb.org">DuckDB</a>) triggered a new architectural reassessment. Suddenly, you had yet another powerful physical data store, yet another query engine — and organisations were asking: do we still need virtualisation?</p>
<p>The answer, as we will explore, is nuanced. But what is clear is that data virtualisation has evolved from a niche integration tool into a foundational enterprise architecture pattern. The journey has taken it from clunky federated SQL in the 1990s to a sophisticated, AI-ready logical data management layer in 2026.</p>
<hr />
<h2>2. What Is Data Virtualisation?</h2>
<p>At its core, data virtualisation is deceptively simple: it creates a <strong>unified, virtual layer</strong> over disparate data sources — without physically moving or replicating the data.</p>
<p>Think of it like a universal remote control. Your TV, sound system, streaming box, and gaming console all have their own interfaces and protocols. A universal remote doesn't replace any of them — it sits on top and lets you control everything through one interface. Data virtualisation does the same for your data sources.</p>
<p>When a business user or application sends a query to a data virtualisation platform, the platform:</p>
<ol>
<li><p><strong>Intercepts</strong> the query at the virtual layer</p>
</li>
<li><p><strong>Translates</strong> it into the native language of each relevant source (SQL for databases, REST calls for APIs, SPARQL for graphs, etc.)</p>
</li>
<li><p><strong>Executes</strong> the query in parallel across sources using pushdown optimisation</p>
</li>
<li><p><strong>Federates</strong> the results back and presents a unified response</p>
</li>
</ol>
<p>The result? A user in a BI tool sees a clean, business-friendly view of "Customer 360" — without needing to know that the data lives across Salesforce, SAP, an Oracle database, and three REST APIs.</p>
<p><strong>What makes modern data virtualisation different from old federated query?</strong></p>
<p>Modern platforms go far beyond simple query federation. They include:</p>
<ul>
<li><p><strong>Semantic modelling</strong> — business-friendly naming, calculated metrics, and reusable logical entities</p>
</li>
<li><p><strong>Row- and column-level security</strong> — fine-grained access control enforced at the virtual layer</p>
</li>
<li><p><strong>Intelligent caching</strong> — frequently accessed results are materialised in-memory or on fast storage, so not every query hits the source</p>
</li>
<li><p><strong>Data lineage and cataloguing</strong> — full visibility into where data comes from and how it flows</p>
</li>
<li><p><strong>REST and GraphQL APIs</strong> — so data isn't just queryable via SQL but also accessible to applications and AI agents</p>
</li>
<li><p><strong>Active metadata</strong> — using AI and ML to automatically suggest optimisations, detect anomalies, and enrich the semantic layer</p>
</li>
</ul>
<p><strong>Data virtualisation is not anti-ETL</strong>. It's an alternative integration pattern — one that favours real-time, governed, logical access over physical data movement.</p>
<hr />
<h2>3. Industry Trends and the Importance of Data Virtualisation</h2>
<p>The market numbers tell a compelling story.</p>
<p>The global data virtualisation market is projected to at a compound annual growth rate (CAGR) of over <strong>20%</strong>, potentially reaching <strong>USD 22.83 billion by 2032</strong>. North America currently holds the largest market share, while Asia Pacific is the fastest-growing region.</p>
<p>But beyond the numbers, three structural forces are making data virtualisation increasingly important:</p>
<p><strong>1. The Proliferation of SaaS and Cloud</strong></p>
<p>The average enterprise now uses hundreds of SaaS applications — Salesforce, Workday, ServiceNow, NetSuite, HubSpot. Each of these is a silo. ETL pipelines into a central warehouse were workable when you had 10 systems; they become a maintenance nightmare at 200. Data virtualisation provides a live, unified access layer across SaaS estates without the replication overhead.</p>
<p><strong>2. The Rise of AI and Agentic Architectures</strong></p>
<p>AI agents are increasingly being used to query data in real time for reasoning and decision-making — not batch-processing it for training. This use case plays directly to virtualisation's strengths: fresh, governed, real-time data access without stale warehouse snapshots. As agentic AI matures, virtualisation-first architectures become even more attractive.</p>
<p><strong>3. Data Mesh and Distributed Ownership</strong></p>
<p>The data mesh paradigm — where data ownership is distributed to domain teams — creates a governance challenge: how do you maintain a unified consumer experience when data is produced and owned by many different teams? Data virtualisation provides the answer: a federated semantic layer that abstracts the distributed reality while presenting a coherent interface to consumers.</p>
<p><strong>4. Regulatory and Data Sovereignty Pressures</strong></p>
<p>Regulations like GDPR, CCPA, and emerging AI governance frameworks require organisations to know exactly where sensitive data lives and who can access it. Data virtualisation, with its centralised policy enforcement and lineage capabilities, makes compliance significantly more tractable than managing it across dozens of physical copies of data.</p>
<hr />
<h2>4. Data Virtualisation and ETL — Why You Almost Certainly Need Both</h2>
<p>Let me be direct about something that a lot of vendor marketing conveniently glosses over: <strong>data virtualisation alone is not sufficient for complex enterprise analytics workloads</strong>. I know that's not the headline anyone building a Denodo pitch deck wants to read, but it is the truth — and understanding <em>why</em> is actually what makes you position the right solution for your environment.</p>
<p>The question is not "virtualisation or ETL?" The question is "which jobs belong to which tool?" And once you see the distinction clearly, the answer becomes obvious.</p>
<p><strong>Where ETL and ELT remain non-negotiable</strong></p>
<p>There is a category of analytical work where the output needs to exist as a persistent, pre-computed artefact — and no amount of clever query federation gets you out of that. Specifically:</p>
<p><strong>Aggregations and enriched business views.</strong> Consider a <code>SUM(revenue) GROUP BY region, product, week</code> across five billion transaction rows, being queried by 300 concurrent BI users. You cannot push that computation back to the source system at query time — you will simply bring it down. The aggregate needs to be materialised somewhere physical. Data virtualisation can serve that result beautifully, but ETL produced it. The two are not competing here — they are sequential.</p>
<p>More importantly: enriched business views are business assets. When your data team has spent weeks joining customer records with transaction history, applying lifetime value formulas, and building a canonical "Customer 360" that finance, marketing, and operations all agree on — that view should be materialised and governed as a first-class artefact. It is not a query you want to recompute from scratch every time someone opens a dashboard.</p>
<p><strong>Windowing and ranking functions at scale.</strong> <code>LAG()</code>, <code>LEAD()</code>, <code>RANK() OVER(PARTITION BY ...)</code>, rolling 30-day averages — these require the engine to hold a full ordered partition in memory to compute correctly. Virtualisation engines can execute these, but they do so by pulling raw data across the wire first. At scale, pre-computing these via ETL in a lakehouse and serving the results via virtualisation is the sensible architecture.</p>
<p><strong>Historical depth.</strong> Source systems purge or archive data. Your operational CRM does not store seven years of customer interactions. Your OLTP database does not retain every version of every record. A data warehouse or lakehouse is often the only place where that history lives in a queryable form — and virtualisation cannot conjure data that does not exist at the source.</p>
<p><strong>ML and AI model training.</strong> Training a machine learning model requires repeated full scans over terabytes of data, shuffled and batched, co-located with compute. This is architecturally incompatible with query-time federation. You need the data physically present. Full stop.</p>
<p><strong>Data quality, deduplication, and entity resolution.</strong> Identifying that "Akash Jain", "A. Jain", and "AJ" in three different systems are the same person is computationally expensive and produces a corrected record that needs to persist. You cannot fuzzy-match your way to a golden customer record on the fly at query time — at least not without making your users regret opening the dashboard.</p>
<p><strong>So where does data virtualisation genuinely excel?</strong></p>
<p>Virtualisation is the right answer — often the <em>best</em> answer — for:</p>
<ul>
<li><p><strong>Operational and real-time reporting</strong> where data freshness matters more than query performance</p>
</li>
<li><p><strong>Self-service federated queries</strong> by analysts exploring data before it has been formally modelled</p>
</li>
<li><p><strong>Regulatory and audit reporting</strong> that requires raw records from source systems with full lineage</p>
</li>
<li><p><strong>Governed API-driven data products</strong> exposed to applications and AI agents</p>
</li>
<li><p><strong>The semantic and governance layer</strong> that sits above your lakehouse, enforcing business definitions and access controls across everything</p>
</li>
</ul>
<p><strong>The architecture that actually works</strong></p>
<p>In any enterprise with real analytical complexity, the honest architecture looks like this:</p>
<img src="https://cdn.hashnode.com/uploads/covers/668910a7dcf7d5e83ddab66e/00859f3b-def0-40e4-ac38-321feefaeac3.png" alt="" style="display:block;margin:0 auto" />

<p>Data virtualisation does not replace the enrichment step — it sits above it, providing a governed, unified interface to the outputs of that enrichment. In this model, the lakehouse does not disappear — but it becomes a <strong>derived performance tier</strong>, not the source of truth. The virtualisation layer holds the semantic model, the governance rules, and the business logic that makes data actually mean something.</p>
<p>The right principle: <em>virtualise first to explore and to serve; materialise when the workload demands it or when the business view is worth preserving as a managed asset.</em></p>
<p>For simpler deployments — a mid-size organisation, mostly SaaS data sources, modest query volumes, no ML ambitions — virtualisation alone may be genuinely sufficient. But if your analytics environment involves enrichment, windowing, high concurrency, or anything a data scientist would call "interesting", you need ETL in the stack.</p>
<hr />
<h2>5. Why Data Virtualisation?</h2>
<p>Beyond the technical arguments, there are compelling business and operational reasons to invest in data virtualisation.</p>
<p><strong>Speed to insight</strong></p>
<p>Traditional ETL pipelines can take weeks or months to build for a new data source. Data virtualisation can connect a new source and expose it through the semantic layer in hours. For business teams waiting on new data to make decisions, this is transformational.</p>
<p><strong>Reduced data redundancy and storage costs</strong></p>
<p>Every physical copy of data has a cost — storage, compute, maintenance, and the hidden cost of keeping it synchronised. Virtualisation eliminates unnecessary copies, reducing storage overhead and the operational burden of keeping multiple systems in sync.</p>
<p><strong>Single source of truth for business definitions</strong></p>
<p>One of the most insidious problems in large organisations is metric inconsistency — the sales team's definition of "revenue" differs from finance's, which differs from the CFO dashboard's. Data virtualisation enforces a single semantic layer where business metrics, entities, and hierarchies are defined once and consumed everywhere.</p>
<p><strong>Accelerating data democratisation</strong></p>
<p>By abstracting the technical complexity of source systems, virtualisation allows data teams to publish clean, governed, business-friendly data products that non-technical users can access directly in their BI tools, notebooks, or AI assistants — without requiring SQL expertise or knowledge of underlying schemas.</p>
<p><strong>Supporting data fabric and mesh architectures</strong></p>
<p>Data virtualisation is the connective tissue of the modern data fabric. It enables the federated governance model that data mesh requires, providing centralised policy enforcement without centralised data storage.</p>
<p><strong>Resilience and agility</strong></p>
<p>When a source system is migrated, upgraded, or replaced, a virtualisation layer acts as an abstraction barrier — consumers of data don't need to change their queries or reports. Only the underlying connector is updated. This architectural resilience can save significant rework costs during technology transitions.</p>
<hr />
<h2>6. Anti-Pattern: Using Your Virtualisation Layer as a Compute Engine</h2>
<p>This is one of the most expensive mistakes in enterprise data virtualisation deployments — and it's surprisingly common. The scenario looks like this. A team sets up a data virtualisation platform over a data lake, connects it to their BI tools, and starts building views. Business users begin running reports. Everything works. Then the bills arrive. What's actually happening under the hood is this: every time a dashboard loads or a report refreshes, the virtualisation layer is reaching into the raw data lake, scanning billions of rows, and computing aggregations — GROUP BY region, SUM(revenue), COUNT(DISTINCT customer_id) — from scratch. Every. Single. Time. The same computation, repeated on every query, for every user, on every refresh cycle. This is not what a virtualisation layer is for. It is a semantic and governance layer — designed to serve pre-structured data assets efficiently, not to replace the transformation layer that should have built those assets in the first place.</p>
<p>Virtualisation platforms are typically licensed or priced on data volume processed, query concurrency, or compute units consumed. When every query is scanning raw tables and computing aggregations on the fly, all of those metrics spike — often dramatically. You end up paying virtualisation-layer rates for work that should have been done once, stored, and served cheaply. The deeper irony is that this pattern hurts twice. The underlying data lake absorbs heavy compute load at query time instead of during scheduled batch processing. And the virtualisation platform consumes capacity doing work it was never designed to do repeatedly. Neither system is being used correctly, and you are paying for both inefficiencies simultaneously.</p>
<hr />
<h2>7. The Leaders in the Space: Gartner Says ...</h2>
<p>Gartner evaluates the data virtualisation landscape primarily through its <strong>Magic Quadrant for Data Integration Tools</strong> — reflecting the reality that virtualisation has become deeply integrated with the broader data integration market rather than existing as a standalone category.</p>
<p>The December 2024 edition of the report evaluated 20 vendors, and the Leaders quadrant tells a revealing story about where the market has matured.</p>
<p><strong>Denodo</strong> — the most specialised data virtualisation vendor in the quadrant — was recognised as a Leader for the <strong>fifth consecutive year</strong>. Denodo's strength lies in its depth: enterprise-grade semantic modelling, fine-grained security, extensive connector library, and a strong data fabric story. It remains the benchmark against which all other virtualisation capabilities are measured.</p>
<p><strong>IBM</strong> — a stalwart of enterprise data integration — has been named a Leader for an extraordinary <strong>20 consecutive years</strong> (as of 2025). IBM's DataStage and Cloud Pak for Data suite incorporate strong virtualisation capabilities, particularly valued by organisations already deep in the IBM ecosystem.</p>
<p><strong>Microsoft</strong> was recognised as a Leader in the 2024 report, driven by Microsoft Fabric's unified analytics platform which incorporates virtualisation through its OneLake architecture and cross-source query capabilities.</p>
<p><strong>TIBCO</strong> (now part of the Cloud Software Group) has historically been a strong player in this space with TIBCO Data Virtualization, serving large enterprises with complex integration requirements.</p>
<p>Beyond the Magic Quadrant, several newer entrants are reshaping the landscape:</p>
<ul>
<li><p><strong>Starburst</strong> (built on Trino) is becoming a popular open-source-based alternative for organisations wanting distributed SQL federation without a proprietary platform</p>
</li>
<li><p><strong>Dremio</strong> positions itself as a lakehouse-native virtualisation and acceleration layer</p>
</li>
<li><p><strong>Databricks Unity Catalog</strong> is increasingly incorporating data sharing and virtualisation capabilities natively within the lakehouse context</p>
</li>
</ul>
<p>What Gartner's research consistently highlights is that the distinction between "data virtualisation" and "data integration" is dissolving. The future belongs to platforms that do both — enabling logical access and physical integration as context demands, governed by a unified semantic and metadata layer.</p>
<hr />
<h2>8. Conclusion: The Architecture Is Finally Catching Up to the Problem</h2>
<p>For years, enterprise data architecture was built around a simple assumption: to use data, you must first move it. That assumption created an entire industry — ETL tools, data warehouses, replication pipelines — and a generation of data engineers whose primary job was plumbing.</p>
<p>Data virtualisation challenges that assumption at its foundation. It says: what if you didn't have to move the data? What if the semantic layer <em>was</em> the architecture?</p>
<p>The honest answer is that virtualisation alone cannot do everything. For complex enterprises — the ones dealing with enriched business views, high-concurrency dashboards, windowed aggregations, and ML workloads — ETL remains load-bearing, not legacy. The goal is not to replace it but to be precise about which layer does which job. Virtualise what needs to be fresh and flexible. Materialise what needs to be fast, enriched, and preserved.</p>
<p>But for the majority of query patterns — operational reporting, data product APIs, governed self-service analytics, and real-time AI agent queries — virtualisation is not just viable. It is often the <em>better</em> choice.</p>
<p>The market is converging on a hybrid model: virtualisation as the authoritative semantic and governance layer, physical storage as a derived performance tier. In this model, the lakehouse doesn't disappear — it becomes a cache you could rebuild at any time. The virtualisation layer becomes the thing you couldn't rebuild: the accumulated business logic, governance rules, and semantic definitions that make data actually mean something.</p>
<p>As agentic AI matures, as data mesh adoption grows, and as enterprises continue to accumulate SaaS sprawl, the case for virtualisation-first architecture will only strengthen.</p>
<hr />
<p><em>If this resonated with you, I'd love to hear your thoughts — especially if you're working through data virtualisation decisions in your own organisation. Drop a comment or connect with me on</em> <a href="https://www.linkedin.com/in/akashjain0802/"><em>LinkedIn</em></a><em>.</em></p>
<hr />
<p><em>Sources referenced in this article:</em></p>
<ul>
<li><p><a href="https://www.denodo.com/en/press-release/2024-12-09/denodo-named-leader-2024-gartner-magic-quadrant-data-integration-tools-five-consecutive-years">Denodo Named a Leader in 2024 Gartner® Magic Quadrant™ for Data Integration Tools</a></p>
</li>
<li><p><a href="https://www.ibm.com/new/announcements/ibm-named-a-leader-in-the-2025-gartner-magic-quadrant-for-data-integration-tools-for-the-20th-consecutive-year">IBM Named a Leader in the 2025 Gartner Magic Quadrant for Data Integration Tools</a></p>
</li>
<li><p><a href="https://www.researchandmarkets.com/report/data-virtualization">Data Virtualization Market Size &amp; Forecast 2025–2032</a></p>
</li>
<li><p><a href="https://tdwi.org/blogs/tdwi-blog/2010/06/the-evolution-of-data-federation.aspx">The Evolution of Data Federation – TDWI</a></p>
</li>
<li><p><a href="https://www.ibm.com/think/insights/data-virtualization-data-lake">Data Virtualization: The Evolution of the Data Lake – IBM</a></p>
</li>
<li><p><a href="https://www.gartner.com/reviews/market/data-virtualization">Gartner Peer Insights: Data Virtualization</a></p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Would you let the LLM take the wheel?]]></title><description><![CDATA[A practical, friendly guide to User-Orchestrated vs LLM-Orchestrated agent workflows
Agents are a bit like actors on a stage. Somebody — or something — has to call “Action!”, cue the right lines, and decide when the show ends. In modern AI architectu...]]></description><link>https://blog.akashja.in/would-you-let-the-llm-take-the-wheel</link><guid isPermaLink="true">https://blog.akashja.in/would-you-let-the-llm-take-the-wheel</guid><category><![CDATA[ai agents]]></category><category><![CDATA[AI Engineering]]></category><category><![CDATA[agentic AI]]></category><category><![CDATA[OpenAI Agents SDK]]></category><category><![CDATA[#anthropic]]></category><category><![CDATA[AI Agent Development]]></category><dc:creator><![CDATA[Akash Jain]]></dc:creator><pubDate>Thu, 09 Oct 2025 08:44:25 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1759992922953/b4046267-231f-471e-bb15-aaaf1afd74bb.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3 id="heading-a-practical-friendly-guide-to-user-orchestrated-vs-llm-orchestrated-agent-workflows"><strong>A practical, friendly guide to User-Orchestrated vs LLM-Orchestrated agent workflows</strong></h3>
<p>Agents are a bit like actors on a stage. Somebody — or something — has to call “Action!”, cue the right lines, and decide when the show ends. In modern AI architectures that somebody can be the human developer, or the LLM itself. That choice — who orchestrates — shapes reliability, transparency, cost, and the kinds of problems your system can solve (or create 😀).</p>
<p>This blog explains the two orchestration patterns, shows how they behave using two simple projects I built (user-orchestrated vs LLM-orchestrated flows), walks through what the LLM is “thinking” when it calls other agents, and gives clear, real-world guidance on when to pick each pattern. Through this blog it becomes apparent, why absolute autonomy is not always good.</p>
<hr />
<h1 id="heading-two-orchestration-patterns-the-short-version">Two orchestration patterns (the short version)</h1>
<p><strong>1) User-Orchestrated Flow</strong> — the human (or program) is the conductor.<br />You explicitly call agents in a fixed sequence: “Call A, then B, then C.” Great when you need deterministic, auditable pipelines. Think compliance reports, billing, or anything where skipping a step is unacceptable.</p>
<p><strong>2) LLM-Orchestrated Flow</strong> — the LLM is the conductor. An agent has tools (which can be plain functions or <em>other agents</em>) and dynamically decides which to call and when. This shines for adaptive, exploratory tasks — troubleshooting, creative ideation, diagnostics — where the correct flow depends on content, context, and nuance.</p>
<hr />
<h1 id="heading-how-these-patterns-look-in-practice">How these patterns look in practice</h1>
<p>I built two <a target="_blank" href="https://github.com/openai/openai-agents-python">OpenAI Agents SDK</a> projects using Python to explore the two different ways of orchestrating the flow.</p>
<p>The project is a trivial and simple one, but helps understand the concept - there are 3 sales agent each with a different personlity type (Humorous / Serious / Professional). These agents generate sales email draft. A fourth agent takes the 3 drafts as inputs, and chooses the single best draft and sends the email.</p>
<h3 id="heading-project-a-user-orchestrated-flow">Project A: User-Orchestrated flow</h3>
<ul>
<li><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1759929065476/61d4008f-ace5-45eb-957c-ce25b5c5e36f.png" alt class="image--center mx-auto" /></p>
<p>  The Python code calls three sales agent functions in parallel (Humorous / Serious / Professional).</p>
</li>
<li><p>The host script collects their outputs and passes them into a Picker agent.</p>
</li>
<li><p>The Picker outputs a selection and <code>function_call</code>; your code calls <code>send_email</code></p>
</li>
</ul>
<p><strong>Properties</strong></p>
<ul>
<li><p>Flow is explicit, predictable, and easy to audit.</p>
</li>
<li><p>The orchestration logic sits in code (not the LLM).</p>
</li>
<li><p>Traces show a linear, developer-specified sequence of LLM calls.</p>
</li>
</ul>
<h3 id="heading-project-b-llm-orchestrated-flow">Project B: LLM-Orchestrated flow</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1759929217401/e3b9ea19-35d7-4ce3-a475-3e3163d13d59.png" alt class="image--center mx-auto" /></p>
<ul>
<li><p>The Picker Agent is given the three sales <strong>agents as tools</strong> (using the OpenAI Agent <a target="_blank" href="https://openai.github.io/openai-agents-python/ref/agent/#agents.agent.Agent.as_tool"><code>agent.as_tool()</code></a>API) .</p>
</li>
<li><p>You call the Picker once. Inside its reasoning, it decides when and which sales agents to call, evaluates outputs, and decides whether to call <code>send_email</code>.</p>
</li>
<li><p>Traces show nested and dynamic LLM calls: the Picker’s call includes sub-calls to the sales agents and the final <code>send_email</code> invocation.</p>
</li>
</ul>
<p><strong>Properties</strong></p>
<ul>
<li><p>Orchestration is emergent, decided in the model’s chain-of-thought.</p>
</li>
<li><p>Less host-side boilerplate; more autonomy for the agent.</p>
</li>
<li><p>Difficult to reason about because of agent’s autonomy.</p>
</li>
</ul>
<hr />
<h1 id="heading-inspecting-openai-traces">Inspecting OpenAI traces</h1>
<h2 id="heading-user-orchestrated">User orchestrated</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1759929796138/948524bd-fc11-42ec-b715-fb73f986fbe0.png" alt class="image--center mx-auto" /></p>
<p>In the user-orchestrated version, every step of the workflow is explicitly controlled by the user (or a top-level script). Each sales agent is called independently, their results collected by the user, and finally, the user invokes the picker agent. You can see in the trace that all three sales agents complete their calls separately — the LLMs do not interact with each other — and the picker agent is invoked only after the user collects their outputs.<br />This pattern offers <strong>predictability and transparency</strong>, making it easier to debug and monitor, but it also requires the user (or a coordinating system) to handle sequencing and logic manually.</p>
<h2 id="heading-llm-orchestrated">LLM orchestrated</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1759929530211/13fc0fc2-1630-4886-8cd0-0d74373b3f37.png" alt class="image--center mx-auto" /></p>
<p>Here, the <strong>Picker Agent</strong> takes over as the orchestrator. Notice how it calls each specialized Sales Agent as a <strong>tool</strong> — visible in the nested traces — and then makes a final call to <code>send_email</code>. The Picker Agent’s LLM decides which sub-agents to invoke, when to stop, and how to rank the results, all without explicit user direction.<br />This pattern demonstrates <strong>Agent-as-Tool</strong> composition: agents invoking other agents as tools. It’s more autonomous and adaptable, allowing the LLM to coordinate reasoning across multiple roles, though at the cost of more complexity and potential latency (as shown by the longer overall runtime).</p>
<hr />
<h1 id="heading-when-is-one-better-than-the-other">When is one better than the other?</h1>
<h2 id="heading-example-1-financial-compliance-reporting">Example 1 : Financial compliance reporting</h2>
<p><strong>Better choice:</strong> User orchestrated flow</p>
<p><strong>Why:</strong> legal/regulatory contexts require deterministic control, stepwise validation, and auditable intermediate outputs. Let humans define exactly what happens, in which order, and when approvals are required. An LLM improvising or skipping checks is unacceptable.</p>
<p><strong>Pattern:</strong> User orchestrates; the LLMs are specialized tools that cannot change the pipeline.</p>
<p><strong>Key priorities:</strong> audit logs, checkpoints for human review, deterministic ordering, strict input validation.</p>
<hr />
<h2 id="heading-example-2-customer-technical-troubleshooting">Example 2 : Customer technical troubleshooting</h2>
<p><strong>Better choice:</strong> LLM orchestrated flow</p>
<p><strong>Why:</strong> troubleshooting is branching, context-dependent, and hard to hardcode. The LLM can decide whether to analyze logs, query docs, run a diagnostic, or escalate — in real time. This reduces brittle flow charts and yields a more natural customer experience.</p>
<p><strong>Pattern:</strong> LLM orchestrates; sub-agents (log parser, KB search, diagnostics) are provided as tools/sub-agents. The LLM composes them adaptively.</p>
<p><strong>Key priorities:</strong> adaptivity, concise context passing, capable sub-agents, careful rate limits &amp; cost controls.</p>
<hr />
<h1 id="heading-final-thought">Final thought</h1>
<p>When I built the LLM-orchestrated version of my workflow, I expected it to choose the single best email and send it. Instead, it sent all three.</p>
<p>The developer instinct took me down the debugging path for a moment. Then it hit me — the model wasn’t following broken logic; it was following its own reasoning. It had concluded, in its own way, that sending all options might be a better outcome. The more autonomy we give LLMs, the more their behavior reflects <strong>interpretation rather than execution</strong>.</p>
<p>Traditional software runs <strong>instructions</strong>. LLMs interpret <strong>intent</strong>. And intent can be ambiguous. You can refine prompts, add guardrails, and constrain tools — and still get different outcomes on different runs. That’s not always a flaw; it’s the nature of probabilistic reasoning.</p>
<p>In a <strong>user-orchestrated flow</strong>, control lives outside the model. The human decides which agents to call and when — ensuring predictability at the cost of flexibility.<br />In a <strong>LLM-orchestrated flow</strong>, control moves inside the model. The agent itself decides how to sequence steps, which tools to invoke, and when to stop. It can adapt, but it can also surprise you.</p>
<p>This unpredictability raises a question that every AI engineer eventually faces:</p>
<blockquote>
<p><strong>How much control are we willing to hand over to systems that can reason?</strong></p>
</blockquote>
<p>As developers, we’re used to determinism — same input, same output. But LLMs don’t think in binaries. They weigh, interpret, and improvise. Their behavior isn’t guaranteed; it’s guided. And as we start using them as orchestrators, we’re not just designing workflows — we’re designing <strong>boundaries for reasoning</strong>.</p>
<p>The LLM sending three emails instead of one wasn’t a failure of orchestration. It was a reminder that <strong>autonomy and predictability live in tension</strong>. Our job, as builders of agentic systems, is not to eliminate that tension — but to <strong>engineer responsibly within it</strong>. Carefully choose the pattern that best serves your use case. And in some cases, you might even combine both — creating <em>controlled autonomy</em> that lets creativity flow within the boundaries you define.</p>
<p>So — <strong>would you let the LLM take the wheel?</strong></p>
<hr />
<h1 id="heading-for-further-reading-and-learning">For further reading and learning</h1>
<p><a target="_blank" href="https://www.udemy.com/share/10dasB3@IeywCn01iM9tAyaEfMBTez1og-Zv2Ix_B0kOm1YrMbGZo2olASJITAzSTY0evqhi/">Learn AI agents with Ed Donner</a></p>
<p><a target="_blank" href="https://www.anthropic.com/engineering/building-effective-agents">Anthropic’s blog on effective agents</a></p>
<p><a target="_blank" href="https://www.youtube.com/watch?v=LP5OCa20Zpg&amp;t=467s">Building AI Agents</a> with Anthropic’s Barry Zhang , Erik Schluntz (YouTube video)</p>
]]></content:encoded></item><item><title><![CDATA[Graph RAG: Beyond Vector Search in Retrieval-Augmented Generation]]></title><description><![CDATA[Retrieval-Augmented Generation (RAG) has quickly become one of the most powerful techniques for grounding large language models (LLMs). Instead of expecting the model to “know everything,” we feed it relevant external knowledge retrieved at query tim...]]></description><link>https://blog.akashja.in/graph-rag-beyond-vector-search-in-retrieval-augmented-generation</link><guid isPermaLink="true">https://blog.akashja.in/graph-rag-beyond-vector-search-in-retrieval-augmented-generation</guid><category><![CDATA[Retrieval-Augmented Generation]]></category><category><![CDATA[Neo4j]]></category><category><![CDATA[Milvus]]></category><category><![CDATA[Hybrid RAG Systems]]></category><category><![CDATA[llm]]></category><category><![CDATA[context engineering]]></category><dc:creator><![CDATA[Akash Jain]]></dc:creator><pubDate>Mon, 29 Sep 2025 06:32:44 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1759127434346/2d924d35-4341-4d62-997e-db3193c661d3.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Retrieval-Augmented Generation (RAG) has quickly become one of the most powerful techniques for grounding large language models (LLMs). Instead of expecting the model to “know everything,” we feed it relevant external knowledge retrieved at query time.</p>
<p>Most implementations use <strong>Vector RAG</strong>: documents are chunked, embedded, stored in a vector database, and retrieved by similarity. This works well for many questions, especially factoid queries.</p>
<p>But what happens when your question is about <strong>relationships</strong> instead of just keywords or themes?</p>
<p>That’s where <strong>Graph RAG</strong> comes in. And when you combine the two, you get the best of both worlds.</p>
<hr />
<h2 id="heading-about-my-application">About my Application</h2>
<p>To explore this space, I built a small application that lets me ask natural language questions about <strong>movies, actors, directors, genres, countries, and languages</strong>, and then compare how Vector RAG and Graph RAG answer them.</p>
<ul>
<li><p><strong>Frontend</strong>: A simple React app with a query box.</p>
</li>
<li><p><strong>Backend components</strong>:</p>
<ul>
<li><p><strong>Neo4j</strong> → Knowledge graph of movies, actors, directors, genres, languages, and countries.</p>
</li>
<li><p><strong>Milvus</strong> → Vector database for semantic similarity search over unstructured movie descriptions and metadata.</p>
</li>
<li><p><strong>Ollama</strong> → Local LLM that turns query results into natural-language answers.</p>
</li>
</ul>
</li>
</ul>
<p>The app shows juxtaposed responses from <strong>Graph RAG and Vector RAG</strong>, including execution time, confidence, and retrieved context. This setup makes it easy to see strengths, weaknesses, and why a <strong>hybrid RAG system</strong> often makes the most sense.</p>
<hr />
<h2 id="heading-how-the-flows-differ">How the Flows Differ</h2>
<p>Here’s what happens under the hood when you ask a question.</p>
<h3 id="heading-graph-rag-flow">🔹 Graph RAG Flow</h3>
<pre><code class="lang-markdown">User Query 
   ↓
LLM Call 1 → Extract entities &amp; relations
   ↓
LLM Call 2 → Generate Cypher query
   ↓
Neo4j (Graph Query Execution)
   ↓
LLM Call 3 → Synthesize final response
   ↓
Answer
</code></pre>
<h3 id="heading-vector-rag-flow">🔹 Vector RAG Flow</h3>
<pre><code class="lang-markdown">User Query
   ↓
Generate Query Embedding
   ↓
Vector Database (Milvus) Search
   ↓
LLM Call → Synthesize final response
   ↓
Answer
</code></pre>
<p>👉 Graph RAG requires <strong>multiple LLM calls</strong> but produces highly precise, explainable results.<br />👉 Vector RAG is <strong>faster and fuzzier (relatively speaking)</strong>, great for semantic overlap but weaker at multi-hop reasoning.</p>
<hr />
<h2 id="heading-real-examples">Real Examples</h2>
<h3 id="heading-complex-queries-graph-vs-vector-rag">Complex Queries: Graph vs Vector RAG</h3>
<p>Query: <em>“Movies where Amy Irving and Matthew McConaughey have acted together in the same movie.”</em></p>
<ul>
<li><p><strong>Graph RAG</strong> → 🎬 <em>Thirteen Conversations About One Thing (2001)</em> — correct.</p>
</li>
<li><p><strong>Vector RAG</strong> → 🎬 <em>U-571</em> and 🎬 <em>Contact</em> — both McConaughey films, but no Amy Irving.</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1759126014524/71743bba-7759-45a7-af28-c11f4954da0d.png" alt class="image--center mx-auto" /></p>
<p><strong>Takeaway</strong>: Vector retrieval is about <em>semantic similarity</em>. Graph retrieval is about <em>explicit relationships</em>.</p>
<hr />
<h3 id="heading-simple-queries-graph-vs-vector-rag">Simple Queries: Graph vs Vector RAG</h3>
<p>Query: <em>“Who directed Kabhi Alvida Naa Kehna?”</em></p>
<p>Both approaches return: <strong>Karan Johar</strong> ✅</p>
<ul>
<li><p><strong>Graph RAG</strong>: Precise, but involves multiple hops (entities → Cypher → execution → synthesis).</p>
</li>
<li><p><strong>Vector RAG</strong>: Cheaper, faster, and good enough for factoid queries.</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1759126021035/1642c732-e8f0-4f8e-86e7-7208a7585634.png" alt class="image--center mx-auto" /></p>
<p><strong>Takeaway</strong>:</p>
<ul>
<li><p>For <strong>simple factoid queries</strong>, Vector RAG can match Graph RAG’s accuracy at a lower cost and latency.</p>
</li>
<li><p>For <strong>complex relational queries</strong>, Graph RAG shines.</p>
</li>
</ul>
<hr />
<h2 id="heading-where-graph-rag-shines">Where Graph RAG Shines</h2>
<ul>
<li><p><strong>Structured reasoning</strong>: Multi-hop queries like <em>“actors who worked with Christopher Nolan and Steven Spielberg.”</em></p>
</li>
<li><p><strong>High precision</strong>: Prevents hallucinations by relying on explicit graph edges.</p>
</li>
<li><p><strong>Complex domains</strong>: Movies, medicine, finance, or supply chains where relationships matter.</p>
</li>
</ul>
<hr />
<h2 id="heading-where-graph-rag-fails">Where Graph RAG Fails</h2>
<ul>
<li><p><strong>Fuzziness</strong>: Queries like <em>“movies about space exploration”</em> may miss results if not encoded explicitly in the graph.</p>
</li>
<li><p><strong>Cold start</strong>: If the graph doesn’t contain the fact, nothing is retrieved.</p>
</li>
<li><p><strong>High upfront effort</strong>: Building and maintaining the graph schema and ingestion pipeline takes work. In my application I had to design a graph schema first and then load it in neo4j. With vector, I simply uploaded the text to Milvus chunked by movie, not much effort.</p>
</li>
</ul>
<hr />
<h2 id="heading-cold-start-in-graph-vs-vector-rag">Cold Start in Graph vs Vector RAG</h2>
<p>The <strong>cold start problem</strong> affects both approaches:</p>
<ul>
<li><p><strong>Graph RAG</strong>: If the node/relationship isn’t ingested, queries return nothing.</p>
</li>
<li><p><strong>Vector RAG</strong>: If a document isn’t embedded and indexed, similarity search won’t find it.</p>
</li>
</ul>
<p>The difference is in end <em>user experience</em>:</p>
<ul>
<li><p>Graph’s missing data is immediately <strong>obvious</strong>.</p>
</li>
<li><p>Vector’s missing data feels less harsh because embeddings “cover” broader text.</p>
</li>
</ul>
<p>This again makes the case for <strong>hybrid RAG</strong>: vectors for coverage, graphs for precision.</p>
<hr />
<h2 id="heading-latency-graph-vs-vector-rag">Latency: Graph vs Vector RAG</h2>
<ul>
<li><p><strong>Graph RAG</strong>: Multiple LLM calls → more accurate, but slower.</p>
</li>
<li><p><strong>Vector RAG</strong>: Single LLM call after retrieval → faster, especially for factoid queries.</p>
</li>
</ul>
<p>Latency grows with query complexity:</p>
<ul>
<li><p>Graph adds reasoning steps.</p>
</li>
<li><p>Vector mostly waits on the LLM’s answer.</p>
</li>
</ul>
<hr />
<h2 id="heading-hybrid-rag-best-of-both-worlds">Hybrid RAG: Best of Both Worlds</h2>
<p>A good RAG application doesn’t pick one — it orchestrates both.</p>
<h3 id="heading-smart-routing-flow">Smart Routing Flow</h3>
<ol>
<li><p><strong>LLM Classifies Query</strong></p>
<ul>
<li><p><em>Factoid/simple</em> → Vector RAG.</p>
</li>
<li><p><em>Relational/multi-hop</em> → Graph RAG.</p>
</li>
</ul>
</li>
<li><p><strong>Dynamic Invocation</strong> of the right retriever.</p>
</li>
<li><p><strong>Fallback Flow</strong>: If uncertain, try Vector first (fast, fuzzy), then Graph for relational verification.</p>
</li>
</ol>
<pre><code class="lang-markdown">User Query
   ↓
LLM Call → Classify Query Type
   ↓
 ┌─────────────┬─────────────┐
 │ Factoid     │ Relational  │
 │ → Vector RAG│ → Graph RAG │
 └─────────────┴─────────────┘
   ↓
If uncertain → Fallback: Vector → Graph
   ↓
Final Answer
</code></pre>
<hr />
<h2 id="heading-case-study-the-bahubali-query">Case Study: The “Bahubali” Query</h2>
<p>Query: <em>“Which other movies has the director of Bahubali directed?”</em></p>
<ul>
<li><p><strong>Graph RAG alone</strong> → Fails because the graph database has the following <em>“Bahubali: The Beginning”</em> or <em>“Bahubali 2: The Conclusion”.</em> It Requires exact match (<em>“Bahubali: The Beginning”</em>).</p>
</li>
<li><p><strong>Vector RAG alone</strong> → Finds the right movie via similarity, but weak on relational precision.</p>
</li>
</ul>
<p><strong>Hybrid solution</strong>:</p>
<ol>
<li><p><strong>Vector Retrieval</strong> → Finds candidate titles (<em>Bahubali: The Beginning</em>, <em>Bahubali 2: The Conclusion</em>).</p>
</li>
<li><p><strong>Query Augmentation</strong> → Rewrite: <em>“Which other movies has the director of Bahubali: The Beginning directed?”</em></p>
</li>
<li><p><strong>Graph Query</strong> → Fetches the precise <em>Director</em> node.</p>
</li>
<li><p><strong>Response Synthesis</strong> → <em>“Bahubali: The Beginning was directed by S. S. Rajamouli.”</em></p>
</li>
</ol>
<p>✅ Vector fixes fuzziness. Graph ensures correctness.</p>
<hr />
<h2 id="heading-side-by-side-comparison">Side-by-Side Comparison</h2>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Aspect</td><td>Graph RAG</td><td>Vector RAG</td></tr>
</thead>
<tbody>
<tr>
<td><strong>Cold start</strong></td><td>Needs explicit nodes/relationships</td><td>Needs embedding; no index = no result</td></tr>
<tr>
<td><strong>Fuzziness</strong></td><td>Weak (exact matches)</td><td>Strong (semantic similarity)</td></tr>
<tr>
<td><strong>Schema effort</strong></td><td>High (domain modeling)</td><td>Low (just embed text)</td></tr>
<tr>
<td><strong>Explainability</strong></td><td>Very high (explicit edges)</td><td>Lower (similarity scores)</td></tr>
<tr>
<td><strong>Complex reasoning</strong></td><td>Strong (multi-hop queries)</td><td>Weak</td></tr>
<tr>
<td><strong>Scalability</strong></td><td>Can be expensive on dense graphs</td><td>Scales with embeddings</td></tr>
<tr>
<td><strong>Coverage</strong></td><td>Limited to modeled facts</td><td>Broad, unstructured text</td></tr>
</tbody>
</table>
</div><hr />
<h2 id="heading-conclusion">Conclusion</h2>
<p>Graph RAG is <strong>not a replacement</strong> for vector search — it’s a <strong>complement</strong>.</p>
<ul>
<li><p>When <strong>relationships matter</strong> → Graph RAG wins.</p>
</li>
<li><p>When <strong>themes or fuzzy matching matter</strong> → Vector RAG wins.</p>
</li>
<li><p>When you want <strong>both coverage and precision</strong> → Hybrid RAG is the future.</p>
</li>
</ul>
<p>Ultimately, the exact flow depends on your <strong>context and goals</strong>:</p>
<ul>
<li><p>Do you need the <strong>highest accuracy</strong>, even if it’s slower?</p>
</li>
<li><p>Or do you need <strong>fast, fuzzy answers</strong> with acceptable trade-offs?</p>
</li>
</ul>
<p>A well-designed hybrid RAG system lets you optimize for both — while keeping users in the loop with transparency about what’s happening behind the scenes.</p>
]]></content:encoded></item><item><title><![CDATA[From ActiveMQ to Kafka: My Journey to Understanding Why We Still Need Queues (KIP-932)]]></title><description><![CDATA[Nine years ago, my amazing former boss and mentor Ajith Kumar, asked me to do an impact analysis on replacing ActiveMQ with Kafka in our integration layer. It was part of a broader product overhaul. We took the plunge and adopted Kafka across the sta...]]></description><link>https://blog.akashja.in/from-activemq-to-kafka-my-journey-to-understanding-why-we-still-need-queues-kip-932</link><guid isPermaLink="true">https://blog.akashja.in/from-activemq-to-kafka-my-journey-to-understanding-why-we-still-need-queues-kip-932</guid><category><![CDATA[confluent]]></category><category><![CDATA[kafka]]></category><category><![CDATA[Confluent Kafka]]></category><dc:creator><![CDATA[Akash Jain]]></dc:creator><pubDate>Mon, 30 Jun 2025 11:46:14 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1751283875925/126a3115-c3f3-4e9f-8a86-e34bb5a0a69d.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Nine years ago, my amazing former boss and mentor <a target="_blank" href="https://www.linkedin.com/in/ajithkumar54/">Ajith Kumar</a>, asked me to do an impact analysis on replacing ActiveMQ with Kafka in our integration layer. It was part of a broader product overhaul. We took the plunge and adopted Kafka across the stack, replacing ActiveMQ and IBM Infosphere Streams.</p>
<p>I was thrilled.</p>
<p>Suddenly, no more duplicating messages for multiple consumers. No more clunky back-pressure issues. Fast, efficient fan-out. Cleanly decoupled microservices. Kafka felt like a breath of fresh air — a single, elegant solution to many of our messaging woes.</p>
<p>For long afterwards, I’d sip my coffee and smugly think, <em>“Why do people even talk about queues anymore?”</em> Kafka could send messages, handle multiple consumers, support retries... What more could you need?</p>
<p>As it turns out, <em>quite a bit</em>. And not in a bad way — just in a nuanced way. Let’s unpack it - without getting too technical about it.</p>
<hr />
<h2 id="heading-kafka-the-mighty-swiss-army-knife-missing-a-few-attachments"><strong>Kafka: The Mighty Swiss Army Knife (Missing a Few Attachments)</strong></h2>
<p>Kafka is incredibly powerful. It moves data at scale, connects systems in real time, and keeps microservices humming along. It’s the de-facto tool for anything to do with real time data and modern event-driven architectures.</p>
<p>But even the best Swiss Army knife isn’t the ideal tool for every job. If you’re opening a wine bottle, a dedicated corkscrew still wins — cleaner, simpler, less likely to injure your hand.</p>
<p>The same is true for messaging models.</p>
<hr />
<h2 id="heading-two-models-two-mindsets"><strong>Two Models, Two Mindsets</strong></h2>
<h3 id="heading-1-publish-subscribe-pub-sub-the-loudspeaker">1. <strong>Publish-Subscribe (Pub-Sub): The Loudspeaker</strong></h3>
<p>Imagine you're at a party and someone yells, “FREE PIZZA IN THE KITCHEN!” Everyone who wants pizza hears it. Those who aren’t interested - or are already three beers deep 🍻🍻 - move on.</p>
<p><strong>Pros:</strong></p>
<ul>
<li><p><strong>One-to-Many:</strong> Say it once, deliver to all interested consumers.</p>
</li>
<li><p><strong>Decoupling:</strong> The speaker doesn’t care who hears it. Listeners come and go.</p>
</li>
<li><p><strong>Scalability:</strong> Producers and consumers scale independently.</p>
</li>
</ul>
<p><strong>Cons:</strong></p>
<ul>
<li><p><strong>Task Ownership?</strong> If someone yells “CLEAN THE KITCHEN,” who actually does it?</p>
</li>
<li><p><strong>No Built-in Progress Tracking:</strong> There’s no concept of one message = one worker = one result.</p>
</li>
</ul>
<h3 id="heading-2-queues-the-to-do-list">2. <strong>Queues: The To-Do List</strong></h3>
<p>Now imagine a whiteboard with "Clean Kitchen" written on it. The first person to see it claims it, does the job, and crosses it off.</p>
<p><strong>Pros:</strong></p>
<ul>
<li><p><strong>Clear Ownership:</strong> One message, one worker.</p>
</li>
<li><p><strong>Workflow Friendly:</strong> Easy to track state — pending, in progress, done.</p>
</li>
<li><p><strong>Built-in Retries:</strong> If someone can’t finish a task, it goes back in the queue.</p>
</li>
</ul>
<p><strong>Cons:</strong></p>
<ul>
<li><p><strong>No Broadcasts:</strong> Not ideal for notifying multiple parties at once.</p>
</li>
<li><p><strong>Potential Bottlenecks:</strong> Tasks can pile up if processing slows down.</p>
</li>
</ul>
<hr />
<h2 id="heading-an-epiphany-in-the-queue">An epiphany in the Queue</h2>
<p>The moment when I truly started appreciating the pub-sub and queue semantics.</p>
<p>Last year, I was helping a telco client build observability and monitoring layer around their Confluent Kafka setup. Out of curiosity, I asked how they used Kafka (in a professional and monitoring way 🤓).</p>
<p>They walked me through their order processing pipeline. Orders flowed through several stages: <strong>Validation → Provisioning → Billing → Activation</strong>, and so on. Each stage had a corresponding Kafka topic.</p>
<p>If an order passed validation, it got published to the Provisioning topic. Critically, an order only moves to the next stage if everything checks out in the current stage. Simple enough.</p>
<p>But if validation failed due to a transient issue (say, a temporary credit check error), they’d <strong>re-publish</strong> the same message back to the Validation topic for it to be retried again later.</p>
<p>Me: <em>“How long do you retry?”</em><br />Them: <em>“We increment a counter as a part of the Kafka message. If it crosses a threshold, we stop and send it to a dead-letter topic.”</em></p>
<p>Here’s the rub: <strong>Kafka pub-sub was doing queue work - poorly</strong>. There was no natural way to handle retries, no ownership semantics, no stateful tracking. Just a lot of manual handling and topic juggling.</p>
<p>It worked — but awkwardly. Like opening a bottle of wine with a butter knife.</p>
<hr />
<h2 id="heading-enter-queue-semantics-in-kafka-kip-932"><strong>Enter Queue Semantics in Kafka (KIP-932)</strong></h2>
<p>Now imagine if they could simply say:</p>
<ul>
<li><p><strong>ACCEPT</strong> – “Validation succeeded. Move it along.”</p>
</li>
<li><p><strong>RELEASE</strong> – “Temporary hiccup. Retry later.”</p>
</li>
<li><p><strong>REJECT</strong> – “This order is bad. Send it to manual review.”</p>
</li>
</ul>
<p>With proper queue semantics (thanks to <a target="_blank" href="https://cwiki.apache.org/confluence/display/KAFKA/KIP-932%3A+Queues+for+Kafka">KIP-932</a>), this is now possible in Kafka. No more simulating queues with topics and retries. The system itself can track message states, retries, and dead letters — without the consumer doing cartwheels.</p>
<p>Even better? <strong>You can mix pub-sub and queue semantics on the same topic.</strong> Need fan-out? Sure. Need one-worker-per-message guarantees? Also yes.</p>
<hr />
<h2 id="heading-the-takeaway-its-not-eitheror-its-both"><strong>The Takeaway: It's Not Either/Or — It’s Both</strong></h2>
<p>My old belief — “Kafka can do everything!” — wasn’t <em>wrong</em>. But it was <em>simplistic</em>. Kafka can simulate queues, yes. But now, with native queue semantics, it doesn’t have to.</p>
<p>Use pub-sub when you want to shout about pizza.</p>
<p>Use queues when someone needs to clean the kitchen.</p>
<p>And now, with modern Kafka, you don’t have to choose between one model or the other — you just have to pick the right semantic for the job.</p>
<p>Versatility and power - that is what KIP-932 brings to Kafka.</p>
]]></content:encoded></item><item><title><![CDATA[Write -> Audit -> Publish (WAP) pattern with Iceberg]]></title><description><![CDATA[Ever Heard of "Garbage In, Garbage Out?" Yeah, it's pretty much the golden rule of data. If your data's a mess, your insights will be too. And with the world increasingly relying on AI, this becomes more important than ever. That's why the Write, Aud...]]></description><link>https://blog.akashja.in/write-audit-publish-wap-pattern-with-iceberg</link><guid isPermaLink="true">https://blog.akashja.in/write-audit-publish-wap-pattern-with-iceberg</guid><category><![CDATA[write-audit-publish]]></category><category><![CDATA[apache hive]]></category><category><![CDATA[big data]]></category><category><![CDATA[data-quality]]></category><category><![CDATA[data quality management]]></category><category><![CDATA[apacheiceberg]]></category><dc:creator><![CDATA[Akash Jain]]></dc:creator><pubDate>Sun, 01 Jun 2025 18:18:47 GMT</pubDate><content:encoded><![CDATA[<p>Ever Heard of "Garbage In, Garbage Out?" Yeah, it's pretty much the golden rule of data. If your data's a mess, your insights will be too. And with the world increasingly relying on AI, this becomes more important than ever. That's why the <strong>Write, Audit, Publish (WAP) pattern</strong> is must in my opinion – it's a super simple 3-step dance to make sure your data is always sparkling clean.</p>
<p>But pulling this off with old-school tools like Hive? That's <strong>not a dance; it's a clumsy wrestling match</strong> – expensive, complicated, and a total headache.</p>
<p>In this article, we will discuss what is WAP, why WAP in Hive is such a pain, and discuss how Apache Iceberg's awesome <strong>branching feature</strong> turns the whole thing into a total breeze. Let's dive in!</p>
<h3 id="heading-so-what-is-wap">So, what is WAP?</h3>
<p>At its core, the WAP i.e. <strong>W</strong>rite → <strong>A</strong>udit → <strong>P</strong>ublish pattern is a simple three-step process designed to ensure you have good quality data. The principle is simple - check before you publish - much like how I checked this blog for errors before publishing. Let’s break it down:</p>
<ul>
<li><p><strong>Write:</strong> First, your ETL job processes new or updated data and writes it to a <strong>temporary, isolated staging area</strong>. Why isolated staging, you ask? So that no one can see this data while you audit this data. This is like a draft version of blog - you have written it but because it is not validated or audited, you have kept it in drafts.</p>
</li>
<li><p><strong>Audit:</strong> Once the data is written to the staging area, you run a series of <strong>rigorous quality checks</strong> against this staged data. This can involve checking duplicates, ensuring referential integrity, or comparing new data against benchmarks laid down by the business. Only if the data passes <em>all</em> these checks does it proceed.</p>
</li>
<li><p><strong>Publish:</strong> If the audit is successful, the newly validated data is swapped with the current production data. This swap is atomic. Why? So that the transition from old to new data is instantaneous and seamless, ensuring that consumers always see a complete and valid state. If the audit fails, the staged data is simply discarded, leaving the production data untouched and you, the data engineer, go back to fixing the issues.</p>
</li>
</ul>
<p>The beauty of WAP lies in its ability to guarantee consistency, prevent partial data exposure, and provide a clear rollback mechanism.</p>
<h3 id="heading-implementing-wap-with-hive-welcome-to-the-wrestle-mania">Implementing WAP with Hive: Welcome to the ‘wrestle mania’</h3>
<p>Before the rise of modern table formats, majority of the data lakes relied on Apache Hive, which managed data as files in HDFS directories. Implementing WAP here is possible, but not without challenges:</p>
<ul>
<li><p><strong>Manual Directory Gymnastics:</strong> ‘Publish’ phase requires physical movement of data from the staging table to production table. Think about doing it on a table daily with large volume. Moving data physically is resource intensive - you need more resources. It is also time consuming - time to insights is delayed. It is not fun!</p>
</li>
<li><p><strong>Lack of True Atomicity:</strong> Alright, so in a perfect world, when you "Publish" your squeaky-clean data, it should be like a magic trick: <em>poof!</em> The old data is gone, and the new data is there, instantly, for everyone, all at once, without a single hiccup. That's what <strong>atomic</strong> means – all or nothing, no in-between states.</p>
<p>  But with Hive, when you make that switch in the publish phase, you move the data <strong>physically</strong>. And during that awkward moment of transition, there is always a window where queries might fail or users see inconsistent data. <code>INSERT OVERWRITE TABLE</code> doesn’t help either because it deletes everything and rewrites, making your table temporarily unavailable or incomplete.</p>
<p>  You know that feeling? When you're trying to pull off a complex manoeuver, you just want everyone to look away for a second and hope you succeed. That's exactly how it feels. Not exactly the seamless experience we're dreaming of.</p>
</li>
<li><p><strong>Schema Evolution Nightmares:</strong> Have you ever tried to change a table's schema in Hive like adding a column or changing a data type? Bet you had to resort to full table rewrites - the sound of it is so scary. It is not without some serious hair-pulling. You want to see what these "simple" renaming steps would look like? Imagine you're just trying to add a new column to your <code>sales</code> table. It might go something like this:</p>
<ol>
<li><p>Read from <code>sales</code> table.</p>
</li>
<li><p>Write <em>all that data</em> (plus your new column) to a temporary table, let's call it <code>sales_new_i_cancelled_my_date_to_add_a_col</code> (because that is the cost you will pay for something as simple as adding a column).</p>
</li>
<li><p>Then, the rename dance:</p>
<ul>
<li><p><code>sales</code> (your old table) → <code>sales_i_dont_know_when_to_delete</code> (because you genuinely won't know when to delete it!)</p>
</li>
<li><p><code>sales_new_i_cancelled_my_date_to_add_a_col</code> → <code>sales</code> (finally, your updated table!)</p>
</li>
</ul>
</li>
</ol>
</li>
</ul>
<p>        Am I convoluting it? Nope, that's just Hive being Hive. It's not me, it's the system! And speaking of nightmares, let's just see how a rollback would go down next...</p>
<ul>
<li><p><strong>Rollback &amp; Cleanup Headaches:</strong> Imagine this: You've just pulled off a painful schema evolution, literally rewriting the whole darn table. You're leaning back, happily sipping your coffee, feeling like a data superhero. Then, <strong>BAM!</strong> The call comes in from the business: that shiny new column you just added? <strong>Wrong data type.</strong> Your coffee suddenly tastes like dread. If not planned meticulously, this kind of nightmare scenario can lead to data loss and force you into hours (or days!) of manual recovery.</p>
<p>  So, how do you roll back that <code>sales</code> table to its old, healthy version in Hive? Well, lucky for you, your smartness led you to preserve the old <code>sales</code> table as <code>sales_i_dont_know_when_to_delete</code>. Good call, because if you'd dropped it, you'd be chugging 10 espresso cups for all-nighters!</p>
<p>  Now, for the actual rollback dance:</p>
<ol>
<li><p>First, you rename your current <code>sales</code> table (the one with the bad data) to something like <code>sales_hail_iceberg_wont_miss_a_date</code> (a little prayer for the future, perhaps?).</p>
</li>
<li><p>Then, you rename your trusty <code>sales_i_dont_know_when_to_delete</code> (your backup) back to <code>sales</code>. Phew!</p>
</li>
</ol>
</li>
</ul>
<p>    But here's the kicker: you won't rest easy until you've figured out how to delete that lingering <code>sales_hail_iceberg_wont_miss_a_date</code> residual data. In all likelihood, you'd secretly keep it, tucked away, until you change your job – just in case you ever need to perform <em>another</em> rollback!</p>
<p>Hope you are not starting to hate Hive. But you see how these challenges make Hive-based WAP implementations fragile, operationally intensive, and a constant source of anxiety for data teams.</p>
<h3 id="heading-implementing-wap-with-iceberg-the-dance">Implementing WAP with Iceberg: The dance</h3>
<p>Apache Iceberg, as an open table format, fundamentally changes the game by bringing database-like capabilities (ACID transactions, schema evolution, time travel) to your data lake. Its branching feature takes the WAP pattern from a headache to a delightful, controlled process.</p>
<p>Think of Iceberg branching like Git for your data. Here's how it makes WAP a breeze:</p>
<ul>
<li><p><strong>1. Write (on a Branch):</strong> Instead of writing to an arbitrary staging directory, you now write to a <strong>dedicated branch</strong> of your Iceberg table (e.g., <code>my_table@dev_feature</code>, <code>my_table@wap_cycle</code>). This branch is an isolated version of your table's history. Production queries continue to read from <code>my_table@main</code> without any disruption or awareness of the changes happening elsewhere.</p>
</li>
<li><p><strong>2. Audit (on the Branch, with Time Travel):</strong> With the new data now committed to your <code>dev_feature</code> branch, the audit phase becomes incredibly powerful. You can run all your validation queries directly against <code>my_table@dev_feature</code>. Need to compare it to yesterday's production data? No problem! Iceberg's time travel allows you to easily query <code>my_table@main</code> (which points to the current production snapshot) or even <code>my_table@main VERSION AS OF 'yesterday'</code>. This enables precise, in-place comparative auditing.</p>
</li>
<li><p><strong>3. Publish (as an Atomic Merge):</strong> This is where Iceberg's branching truly shines. If your <code>dev_feature</code> branch successfully passes all audits, "publishing" the data simply involves <strong>merging</strong> that branch into your <code>main</code>(production) branch. This merge is an <strong>atomic metadata operation</strong>.</p>
<ul>
<li><p><strong>Atomicity:</strong> The <code>main</code> branch's pointer is instantaneously updated to reflect the state of the <code>dev_feature</code>branch. There's no data copying or directory renames.</p>
</li>
<li><p><strong>Zero Downtime:</strong> Readers on <code>main</code> seamlessly switch to the new, merged data once the merge is complete. Ongoing queries see the snapshot they started with, ensuring complete consistency.</p>
</li>
<li><p><strong>Effortless Rollback:</strong> If a critical bug somehow slips through and is discovered <em>after</em> the merge, you can use Iceberg's time travel to instantly roll back the <code>main</code> branch to a snapshot <em>before</em> the merge. Your data is immediately restored to a known good state.</p>
</li>
<li><p><strong>Parallel Development:</strong> Multiple teams can work on different branches, each conducting their own WAP cycles in isolation. When ready, their branches can be merged into <code>main</code> independently, accelerating development without sacrificing quality.</p>
</li>
</ul>
</li>
</ul>
<h3 id="heading-conclusion">Conclusion</h3>
<p>So there you have it! We've journeyed through the wild west of Hive's WAP patterns, full of manual gymnastics, atomic glitches, schema nightmares, and those dreaded rollback headaches. It is less about data engineering and more about data wrestling, right?</p>
<p>But then, we now have Apache Iceberg with its super-powered branching feature, waving its magic wand! By letting you <strong>Write on a Branch</strong>, <strong>Audit with Time Travel</strong>, and <strong>Publish with an Atomic Merge</strong>, Iceberg transforms the Write, Audit, Publish pattern from a complex, risky, and manually intensive chore into a streamlined, automated, and supremely reliable workflow.</p>
<p>It's like finally bringing the best practices of software development – think isolated environments, atomic commits, and that glorious "Ctrl+Z" for your entire dataset – directly into your data world. No more sweating bullets, no more data graveyards, just clean, high-quality data effortlessly powering your business decisions.</p>
]]></content:encoded></item><item><title><![CDATA[Demystifying Lambda and Kappa architectures]]></title><description><![CDATA[Remember when 'big data' just meant patiently waiting for yesterday's sales figures to show up on your desk? Ah, the good old days of batch processing! With the advent of real-time data, data processing architectures also evolved. Our apps are real-t...]]></description><link>https://blog.akashja.in/demystifying-lambda-and-kappa-architectures</link><guid isPermaLink="true">https://blog.akashja.in/demystifying-lambda-and-kappa-architectures</guid><category><![CDATA[lambda architecture]]></category><category><![CDATA[kappa architecture]]></category><category><![CDATA[big data]]></category><category><![CDATA[analytics]]></category><category><![CDATA[kafka]]></category><category><![CDATA[apache-flink]]></category><category><![CDATA[Spark Streaming]]></category><category><![CDATA[#apache-spark]]></category><dc:creator><![CDATA[Akash Jain]]></dc:creator><pubDate>Wed, 21 May 2025 18:02:37 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1747851558484/0687e56b-02de-46a2-912b-c75c13004bcb.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Remember when 'big data' just meant patiently waiting for yesterday's sales figures to show up on your desk? Ah, the good old days of <strong>batch processing</strong>! With the advent of real-time data, data processing architectures also evolved. Our apps are real-time, our customers expect instant gratification, and frankly, so do our dashboards!</p>
<p>This shift brought about new <strong>architectural blueprints</strong> for handling vast amounts of information: <strong>Lambda</strong> and <strong>Kappa architectures</strong>.</p>
<p>Back in 2011, <a target="_blank" href="https://www.linkedin.com/in/nathanmarz/?ref=blog.akashja.in"><strong>Nathan Marz</strong></a> (creator of Apache Storm) unveiled the <strong>Lambda architecture</strong>, a two-pronged approach to tackle both the deep historical dives and the urgent real-time needs. Then, in 2014, <a target="_blank" href="https://www.linkedin.com/in/jaykreps/?ref=blog.akashja.in"><strong>Jay Kreps</strong></a> (a co-founder of Confluent) proposed the <strong>Kappa architecture</strong> as an alternative, aiming to solve the very same problem with a different philosophy.</p>
<p>Both offer a different view for ingesting, storing, processing, and utilizing data, each with its own set of strengths and weaknesses. Ready to dive in and get a fun, basic understanding of what makes these powerful paradigms tick?</p>
<hr />
<h2 id="heading-lambda-architecture">Lambda Architecture</h2>
<p>Imagine trying to get the full picture of your business, both looking back at everything that's ever happened and seeing what's happening now. That's the challenge the <strong>Lambda architecture</strong> was built to conquer, and historically, it has been a widely adopted blueprint for big data processing. It's structured around <strong>three distinct layers</strong>:</p>
<ol>
<li><p><strong>The Batch Layer:</strong> Think of this as your organization’s ultimate historical archive and deep analysis engine. Every single piece of data ever collected lands here, untouched and immutable. This is where you run those heavy-duty, comprehensive analytics batch jobs, sifting through mountains of past information to forge perfectly accurate, pre-computed views. As you can imagine, it demands vast storage for historical data, but in return, it unlocks the power to perform incredibly complex analytics across your entire historical dataset.</p>
</li>
<li><p><strong>The Speed Layer (or Stream Processing Layer):</strong> Now, picture data rushing in by the second – customer clicks, sensor readings, social media mentions. The speed layer is the lightning-fast counterpart, gobbling up these real-time streams as they arrive. Its mission? To deliver immediate, low-latency insights and constantly update those 'right now' views, seamlessly filling the gap while the batch layer takes its time with deep dives.</p>
</li>
<li><p><strong>The Serving Layer:</strong> This layer is where the magic truly comes together. It's where the accurate insights from the batch layer are merged with the fresh, real-time pulse from the speed layer. This delivers a unified, up-to-the-minute picture to your applications and dashboards. You can select the perfect combination of data stores for your specific needs—perhaps a data warehouse for comprehensive OLAP queries, or a blazing-fast key-value store like Redis for those critical low-latency applications.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747812372672/b74ba425-f164-414d-aef1-aedf2ca79740.png" alt="Lambda architecuture (trivia - notice the λ in blue) " class="image--center mx-auto" /></p>
</li>
</ol>
<h3 id="heading-benefits-of-lambda-architecture">Benefits of Lambda Architecture</h3>
<p>Despite its complexities, the Lambda architecture offers significant advantages that made it a foundational choice for many big data systems:</p>
<ul>
<li><p><strong>High Accuracy and Completeness:</strong> The batch layer, by processing all historical data, ensures that analytical views are highly accurate and complete, making it ideal for critical reporting and historical analysis where precision is paramount.</p>
</li>
<li><p><strong>Robust Fault Tolerance:</strong> The immutability of the batch layer's data and its ability to reprocess ensures a high degree of fault tolerance. If there are issues in the speed layer, the batch layer can always regenerate accurate views.</p>
</li>
<li><p><strong>Effective Handling of Late-Arriving Data:</strong> Data that arrives out of order or with delays is seamlessly incorporated into the batch layer's processing cycle, ensuring that no information is lost and all data eventually contributes to the final, accurate historical view.</p>
</li>
</ul>
<h3 id="heading-challenges-with-lambda-architecture">Challenges with Lambda Architecture</h3>
<p>However, the Lambda architecture isn't without its challenges. Consider data transformation logic as an example, such as validation or standardization. You would need to implement this logic independently in both the batch and streaming layers. This duplication of code and effort leads to <strong>significant operational complexity</strong> and makes <strong>maintenance a considerable challenge</strong>. Furthermore, due to the inherent differences in how data is processed in each layer, achieving <strong>consistent data views</strong> across both batch and real-time outputs can be far from straightforward.</p>
<hr />
<h2 id="heading-kappa-architecture">Kappa architecture</h2>
<p>Lambda architecture may feel a bit like juggling two separate data kitchens. Kappa architecture provides an alternative that takes a more minimalist and elegant approach. The core idea underlying Kappa is to radically simplify data processing by merging those distinct batch and speed layers into one powerful, unified engine.</p>
<p>How does it achieve this? By treating all data as a continuous stream originating from an immutable event log (think of it as a never-ending tape recorder that captures every single event). When you need to process historical data, you don't run a separate batch job; you simply 'rewind' and replay that very same stream through your unified processing engine. This streamlined approach gets rid of the operational headache of maintaining two separate codebases, with processed results consistently landing in a database ready for instant queries.</p>
<p>At its heart, Kappa architecture relies on these core components:</p>
<ul>
<li><p><strong>Immutable Log:</strong> The single source of truth for all data, capturing every event in a durable, ordered, and replayable sequence.</p>
</li>
<li><p><strong>Streaming Processing Layer:</strong> The unified engine that consumes data from the immutable log, performing all necessary transformations and aggregations for both real-time and historical views.</p>
</li>
<li><p><strong>Serving Layer:</strong> A data store that holds the materialised views generated by the streaming processing layer, making them available for immediate queries by applications and dashboards.</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747847444837/fc31e031-9438-47a4-838b-bea398b0148f.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-advantages-of-kappa-architecture">Advantages of Kappa Architecture</h3>
<ul>
<li><p><strong>Simplified Architecture &amp; Unified Codebase:</strong> This is Kappa's crowning glory! By having a single processing layer, you write and maintain just one set of logic for all your data transformations, whether for real-time insights or historical analysis. This drastically reduces development complexity and makes your team's life much easier.</p>
</li>
<li><p><strong>Reduced Operational Overhead:</strong> With fewer moving parts and a single codebase, managing, deploying, and monitoring your data pipeline becomes significantly less burdensome.</p>
</li>
<li><p><strong>Easier Maintenance:</strong> Debugging becomes a much simpler affair when you only have one set of code to inspect. Updates and improvements can be rolled out more efficiently.</p>
</li>
<li><p><strong>Consistent Data Views:</strong> Since all data (historical and real-time) flows through the same processing logic, you inherently achieve a high degree of consistency across all your materialised data views.</p>
</li>
<li><p><strong>Optimal for Real-time Analytics:</strong> Designed from the ground up for streaming, Kappa excels at delivering immediate insights, making it perfect for real time applications.</p>
</li>
</ul>
<p>However, like any powerful tool, Kappa isn't a silver bullet. It comes with its own set of considerations:</p>
<h3 id="heading-challenges-and-considerations-with-kappa-architecture">Challenges and Considerations with Kappa Architecture</h3>
<ul>
<li><p><strong>Reliance on a Robust Streaming Platform:</strong> Kappa architecture relies on immutable log. To truly shine, you need a high-throughput, highly durable, and scalable streaming platform like Apache Kafka. This core component is non-negotiable and requires careful selection and management.</p>
</li>
<li><p><strong>Potential Challenges with Large-Scale Stream Replay:</strong> While the idea of replaying the stream for historical processing is great, for truly massive, multi-year historical datasets, the initial re-computation (or 'rewinding') can still be a significant and time-consuming operation. Proper planning for how to manage these replays is crucial.</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Alright! Before we drop the curtains, let's summarize! We've just navigated the fascinating, sometimes complex, world of Lambda and Kappa architectures. We dove into the layers of Lambda, understanding its powerful dual approach that delivers both solid historical accuracy and immediate real-time insights—even if it meant a bit of operational juggling. We then met the minimalist Kappa, which promises to simplify everything by treating all data as one continuous, replayable stream from an immutable event log.</p>
<p>So, who wins this architectural showdown? Well, there's no single winner. Lambda is your reliable workhorse for deep historical dives, especially when late-arriving data is common. But Kappa? It's the agile speed demon, cutting through complexity and offering inherent consistency, making it a star for modern, stream-first applications. Your choice truly depends on your specific needs, the complexity you're willing to manage, and your hunger for ultimate data consistency.</p>
<h2 id="heading-references">References</h2>
<ul>
<li><p>This YouTube video - <a target="_blank" href="https://www.youtube.com/watch?v=fPlgoTLJh38">Lambda Architecture in 10 minutes or less</a></p>
</li>
<li><p>The original blog where it all started “<a target="_blank" href="https://www.oreilly.com/radar/questioning-the-lambda-architecture/?ref=blog.akashja.in">Questioning the Lambda Architecture</a>” blog by <a target="_blank" href="https://www.linkedin.com/in/jaykreps/">Jay Kreps</a></p>
</li>
<li><p>This fantastic blog <a target="_blank" href="https://www.kai-waehner.de/blog/2021/09/23/real-time-kappa-architecture-mainstream-replacing-batch-lambda/">Kappa Architecture is Mainstream Replacing Lambda</a> by <a target="_blank" href="https://www.linkedin.com/in/kaiwaehner/?ref=blog.akashja.in">Kai Waehner</a>.</p>
</li>
</ul>
]]></content:encoded></item></channel></rss>