I’ve been working on Calibration City, a site for prediction market calibration and accuracy analysis. I want the site to be useful for experienced prediction market users as well for as people who have never heard of them before.
Example user questions we aim to answer include:
I'm interested in sports, how good is Manifold at predicting games a week in advance? Do other sites have a better track record?
This PredictIt market is trading at 90¢ but has less than 2000 shares in volume. How often does a market like that end up being wrong?
I’m worried about the accuracy of markets that won’t resolve for a long time. What is the typical accuracy of a market over a year away from resolution?
Calibration City is currently live! We completed the MVP in January 2024 with additional features landing in February and March. We integrate data from Kalshi, Manifold, Metaculus, and Polymarket, with over 130,000 total markets and over 300 visitors in the past month.
There are currently two main visualizations: calibration and accuracy. The calibration page shows a standard calibration plot for each supported platform. The user can choose how markets are sorted into bins along the x-axis (by the market probability at a specific point, or a time-weighted average). They can also apply weighting to each market based on values such as the market volume, length, or number of traders. Users can filter the total set of markets used for analysis based on keyword, category, duration, volume, or other features. Is Polymarket consistently overconfident? Underconfident? What about on long-term markets?
The accuracy plot allows users to directly compare different factors’ effects on market accuracy. In addition to the standard filters and binning options, the user can select a factor such as the market date, total trade volume, market length, or number of traders. With this additional axis, users can learn how (or if) those factors actually impact market accuracy. Does higher trade volume really increase accuracy? If so, by how much? What about more recent markets?
The beginner-friendly introduction page is a Socratic-style dialog introducing the reader to basic concepts of forecasting before introducing the premise of the site. The resources page lists the current capabilities of the site, answers common questions about the data gathering, and lists a few community resources for further reading. A simple list page displays all markets in the sample, useful for locating outliers or trends over similar markets.
Calibration City was awarded $3500 from the Manifold Community Fund, the highest of any project submitted. It was recently mentioned in Nuño Sempere‘s forecasting newsletter for June 2024.
My next big goal is to address one of the biggest problems with naive calibration comparison: different platforms predict different things. Some platforms automatically create dozens of markets in the style of “Will X metric be in range Y at time Z?” every day while other platforms have far fewer markets with longer timespans and more uncertainty. The analysis you currently see on Calibration City can be very useful but it’s unfair to calculate the calibration score of each platform and compare them directly.
In order to address this, we need to classify markets into narrow questions, such as “Who will win the 2024 US presidential election?” or “Will a nuclear weapon be detonated in 2024?”. We can find all markets across all platforms that predict the relevant outcome, check the resolution criteria to make sure they’re essentially equivalent, and then compare those with a relative Brier score that awards markets that were correct earlier. Once we have a corpus of these questions and their constituent markets, we can calculate a score for each platform in each category and fairly compare them.
I plan to do this classification primarily with GPT-4, starting with smaller samples and building a corpus from there. A fair amount of human effort will still be necessary to identify variations in resolution criteria and other edge cases. Once we have the dataset I can build a scorecard or dashboard that fairly compares each platform in each category, allowing users to definitively answer which market platform is most accurate in each field.
Some of my other planned features for this project include:
Integrate data from more sites, such as PredictIt, Futuur, and Insight Predictions
Get more data from the sites we do monitor, such as market volume from Polymarket
Easily share visualizations with a link or export a summary card for social sharing
Natively support advanced market types such as multiple-choice or numeric/date markets
Generate individual user calibration plots with the same methodology that we use for platforms
Create an easy-to-use cross-platform bot framework for arbitrage or reactive betting
Have a dashboard of live markets with comparisons/discrepancies across platforms
Provide an estimated probability spread for live markets based on similar past markets
The primary use of this funding will be as compensation for my time. In addition, some planned features will incur direct costs:
Classifying over 130,000 markets with GPT-4 in order to find matches
VPN connections for platforms that restrict users based on location
Additional compute server capacity for increased load
I’m wasabipesto - you may recognize me from the Manifold discord. You can find my contact information and other projects over at my website.
I have a full-time job but I enjoy working on projects like this in my spare time. I am not typically paid for hobby projects so I work on whatever interests me at the moment. Funding from this grant would compensate me for my time and incentivize me to work on additional features when I would otherwise be unproductive or working on other projects.
Calibration City is fully open-source on GitHub and open to community contribution. You can see the live data used by the site for your own analysis at https://api.calibration.city/
I received retroactive funding for this project from the Manifold Community Fund. I don’t receive any ongoing funding for this project.
App status across various funders