My Source Code Vulnerability Dataset: CrossVul
A while back, I had the privilege of leading the creation of CrossVul, my comprehensive source code dataset designed to advance automated vulnerability detection and repair across multiple programming languages. Our paper, “CrossVul: A Cross-Language Vulnerability Dataset with Commit Data,” was published at the ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE).
Why I Created CrossVul
At the time, the datasets available for vulnerability research—like SATE IV—were synthetic or limited in scope. There were no large-scale, real-world datasets suitable for training practical classifiers that work across different programming languages.
I needed a dataset grounded in real software development history, with authentic vulnerable and patched code, along with commit data tracing how vulnerabilities were introduced and fixed. Since none existed, I created CrossVul: a multi-language dataset over 1.4 GB in size, containing real vulnerable and patched source code plus detailed commit metadata.
The Impact of CrossVul
Since its release, CrossVul has enabled researchers and developers to:
- Analyze vulnerability patterns across different programming languages,
- Train and benchmark machine learning models for automated vulnerability detection and repair,
- Understand the lifecycle of vulnerabilities through commit histories.
It’s been rewarding to see the dataset help push forward software security research.
Access CrossVul
You can explore the dataset and read the full paper here:
Looking Ahead
Security challenges continue to grow, and datasets like CrossVul play a vital role in powering new research and better tools. I’m proud to contribute this resource to the community and excited to see its ongoing impact.