Security Professional and Researcher

Georgios Nikitopoulos

My Source Code Vulnerability Dataset: CrossVul

Posted at # publications

A while back, I had the privilege of leading the creation of CrossVul, my comprehensive source code dataset designed to advance automated vulnerability detection and repair across multiple programming languages. Our paper, “CrossVul: A Cross-Language Vulnerability Dataset with Commit Data,” was published at the ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE).

Why I Created CrossVul

At the time, the datasets available for vulnerability research—like SATE IV—were synthetic or limited in scope. There were no large-scale, real-world datasets suitable for training practical classifiers that work across different programming languages.

I needed a dataset grounded in real software development history, with authentic vulnerable and patched code, along with commit data tracing how vulnerabilities were introduced and fixed. Since none existed, I created CrossVul: a multi-language dataset over 1.4 GB in size, containing real vulnerable and patched source code plus detailed commit metadata.

The Impact of CrossVul

Since its release, CrossVul has enabled researchers and developers to:

It’s been rewarding to see the dataset help push forward software security research.

Access CrossVul

You can explore the dataset and read the full paper here:

CrossVul Paper and Dataset

Looking Ahead

Security challenges continue to grow, and datasets like CrossVul play a vital role in powering new research and better tools. I’m proud to contribute this resource to the community and excited to see its ongoing impact.