Skip to content
OddsRelay

How to build an arbitrage scanner (the data layer)

Building a sure-bet finder is two jobs. The detection maths is the fun part; the data layer underneath is the one most teams end up licensing. Here is why, and what it needs to deliver.

James6 min read

An arbitrage scanner is two layers, and they are wildly unequal in cost. The data layer gathers prices from many books plus the exchange lay side, keeps them fresh, and normalises them so the same event lines up across sources. The detection layer reads that data, finds sets of prices that sum to under 100%, ranks them, alerts, and sizes stakes. The detection layer is the part you can write in a weekend. The data layer is the part that never finishes, and it is the part most teams end up licensing.

This post is for developers scoping a build. It separates the two layers cleanly, so you spend your effort where the differentiation is and buy the part that is only a burden. For the theory underneath, see how arbitrage data works.

What are the two layers of an arb scanner?

The data layer feeds the detection layer, and everything downstream inherits its quality. Split the responsibilities like this:

LayerOwnsDifficulty
Data layerMany books, exchange lay, freshness, normalisation, completenessThe hard, unending part
Detection / product layerThresholds, ranking, alerts, staking mathsThe part you differentiate on

A sure bet is a relationship between prices, not a single number. To see it, you need the same selection priced by two opposing sources at the same moment, both fresh, both correctly identified as the same selection. If either side is stale, or the two rows do not actually refer to the same outcome, the scanner produces a phantom arb that vanishes the moment someone tries to place it. That failure is a data-layer failure, and no amount of clever detection code fixes it.

What does the detection layer need from the data?

Detection is only as good as four properties of the data underneath it. Get these right and the maths is straightforward; get any one wrong and the scanner lies to your users.

  • Normalisation across books: Arsenal vs Chelsea at one book and Arsenal FC v Chelsea FC at another must resolve to the same event, and their outcomes to the same selection. Without a consistent event-and-market key, you cannot compare prices at all.
  • Freshness: every price needs a recent timestamp. A stale odd looks usable and isn't, which is the most expensive kind of wrong.
  • The exchange lay side: many arbs and every matched position need a lay price from a real exchange, with enough liquidity to fill. Back-only data cannot express the opposing side.
  • Completeness: if a selection silently drops when a source hiccups, your scanner shows a one-sided market and misprices the arb. Nothing missing without you knowing.

These four are exactly what a maintained feed exists to guarantee, and exactly what a home-built pipeline erodes over time as bookmaker surfaces change. We go deep on this in what makes an arb feed usable.

What does one arb-ready row look like?

A usable row carries both opposing prices already paired, so detection reads a relationship rather than reconstructing one. Here is the shape of a single normalised, matched row (illustrative, not live data):

One arb-ready row · illustrative shape
{
  "event": "Arsenal vs Chelsea",
  "market": "match_odds",
  "selection": "Arsenal",
  "back": { "bookmaker": "bet365", "odds": 2.20 },
  "lay":  { "exchange": "betfair", "odds": 2.16, "liquidity": 1420 },
  "rating": 100.9,
  "qualifying_loss": 0.08
  // ... region, feed_type and freshness fields elided
}

The back and lay blocks are the two opposing prices for the same selection. The rating above 100 flags a set that sums under 100% implied probability: the arb signal itself. The qualifying_loss field tells your staking maths the expected position. A raw multi-book API gives you back prices and leaves the pairing, the exchange side, and the rating for you to build. That pairing is the data layer's real output. The full envelope is in the API docs.

Where do you actually differentiate?

You differentiate in the product layer, not the plumbing. Two scanners reading the same clean data compete on how they present and act on it:

  • Thresholds and filters: minimum rating, sport, market, book selection, minimum liquidity, stake-size limits.
  • Ranking: sorting live opportunities by value, freshness, or how long they are likely to last.
  • Alerts: push, email or webhook the moment a set crosses a user's threshold, deduplicated so users are not spammed.
  • Staking maths: converting a rated set into back and lay stakes, accounting for commission and rounding, so the position is placeable.

This is the part worth your engineering time. It is where product judgement lives and where users feel the difference. None of it requires you to own a collection pipeline for 60+ books.

Should you build the data layer yourself?

For most teams, no. Building the data layer means committing to collecting many bookmakers indefinitely, matching each back price to a live exchange, monitoring freshness, and repairing coverage every time a source changes. bet365 is widely regarded as the hardest book to cover well, and it is the one your users most expect to see. The prices are the easy fraction of the work. Keeping them accurate, complete, normalised and fresh is the rest, and it recurs forever.

Licensing the data layer flips the maths. You get 60+ UK books with bet365 included, each back price already matched against three exchanges (Betfair, Smarkets and Matchbook) for the lay side, normalised so events line up, delivered as one authenticated call returning predictable JSON. Your scanner reads arb-ready rows on day one and you build detection on top. That is the trade-off we lay out in full in buy versus build.

How fresh is the data underneath?

Freshness sets the ceiling on what your scanner can honestly find, so be precise about it. Our posture is pre-match polling on roughly a few-second cycle, which suits pre-match arbitrage well. We do not claim sub-second in-play streaming, because that is not what we ship. Freshness, uptime and latency are published on the coverage dashboard so the number is checkable rather than asserted. If a data source will not show you its reliability, treat any real-time claim with caution.

The short answer

Build the detection and product layer, because that is where your scanner earns its keep. Buy the data layer, because normalising many books plus exchange lay and keeping it all fresh is a burden that never ends. OddsRelay supplies exactly that layer: 60+ UK books, bet365 included, matched against three exchanges, arb-ready. It powers a leading UK matched-betting platform today. The fastest way to see whether the rows are what your scanner needs is a free trial, and you can check what is live right now on the coverage dashboard before you commit.

Arbitrage & value betting

Written by

James

Founder, OddsRelay

James is the founder of OddsRelay — the odds-data feed behind matched betting, arbitrage and odds-comparison products: 60+ UK bookmakers with bet365 included, matched against exchange lay prices and delivered as one clean, documented API. He writes here about how that data layer actually behaves — coverage, matching, freshness and the trade-offs — from the side that builds and runs it. The same feed powers a leading UK matched-betting platform today.

Part of the Arbitrage & value betting cluster

Arbitrage betting data, explained: how live arb works

18+ · Data product for licensed operators. Please gamble responsibly.