Public Use Healthcare Claims
Where do you go if you want high-quality healthcare payer claims data? Well, SynPUF! of course.
What is public use healthcare claims?
CMS 2008-2010 Data Entrepreneurs' Synthetic Public Use File (SynPUF) provides a realistic set of Medicare claims data that is available in the public domain. CMS.gov lists a few purposes for this data:
- software and ETL pipeline development
- research and analytics practitioner training in complexity of claims data
- support data mining and advanced analytics activities.
Why use this data?
We are currently using this data to build solutions, pipelines, data models, and real-world analytics frameworks. This data looks so realistic that we can actually work with it, massage it, and work through issues that we would encounter with typical claims data inside a payer environment.
In addition, I am using this data to generate synthetic Eletronic Medical/Health Records (EMR/EHR) in an XML format called HL7 (FHIR) via Python code. I will do a later write-up on this particular code process.
Large collection of realistic-looking data
There is a fairly "large" amount of claims data available in this set. It will defeinitely put your processes to task, but won't overwhelm.
Some highlights:
- ~2.3 million unique beneficiary (member) entries in each 2008-2010
- ~17 million combined inpatient and outpatient (IP and OP, respectively) medical claims for 2008-2010
- ~111 million prescription drug events (PDE / Rx / pharmacy)
- ~95 million carrier claims
The data is available in very managable pieces. In fact, each of these subject areas is available in its own file. On top of that, each is broken out into 20 samples that correspond to each other. In that way, you can pull Sample 01 from each beneficiary, IP, OP, etc. and get all the corresponding claims for those members.
Good documentation
CMS has provided very concise data definitions and dictionaries. I mean, it's good in general, but very good for CMS.gov.
- Data Users Document - This will give you the overview and counts for each of the tables, as well as basic analytical and use descriptions.
- Codebook - Simple, extensive description, valid values, counts, etc. for each field in the tables. You will also find external references for some values.
Caveats
The data is not perfect, and you can read more about interpretation limitations in the Users Doc; many of these limitations revolve around the fact that the data is completely deidentified.
However, there are some specific considerations I think are worth pointing out that make this limited:
- Data from 2008-2010 is a bit outdated. The healthcare landscape has changed since this time. Not least of which is the implementation of the Affordable Care Act (ACA), which had some wide-ranging effects on behavior and reporting.
- Medical claims (IP/OP) uses deprecated ICD-9-CM diagnosis codes. Since Oct 2015 ICD-10 has been required on medical claims data, so you will need to convert these. For that, check out CMS's General Equivalency Model (GEM).
- Analytics taken with a grain of salt. This is discussed in detail in the document, but because this is a public use file, the deidentified nature of the data means not all correlations will be represented the same way in real world data.
Other Resources
Here are some other resources that pair well with the SynPUF data for analytics and research purposes:
- CMS ICD-10 to HCC 2018 crosswalks - Used primarily for Risk Adjustment (RA). These codes will allow to roll-up to a high-level HCC from the diagnosis code and have associated Risk Adjustment Factor values.
- SSA to FIPS state and county crosswalk - SynPUF beneficiary file uses SSA county codes, which are not widely used for geo-spacial analysis. FIPS is more highly cross-compatible and will allow you to use it in more software packages.
Just for fun - try this use case
Ok, so I would also like to put out a challenge for someone who wants to get into Healthcare data. I think a good Data Analyst or Data Scientist should be able to work from end-to-end. So, download the datasets, load them, map them, and then try to answer some business problems:
Business Problems: We are a national health insurance payer.
- What diseases have the biggest impact on cost?
- How much should we expect to pay next year for a Female, 70-74 year old member with Heart Failure and Osteoporosis?
- What diseases were most common among members who died?
Try it out with any stack you like. I would recommend the matt.guide docker container with Python and MariaDB (MySQL).