The Limits of the Identifiable: Challenges in Python Version Identification with Deep Learning

Marcus Gerhold, Lola Solovyeva, Vadim Zaytsev

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

8 Downloads (Pure)

Abstract

The evolution of Python requires accurate version identification to facilitate compatibility and ongoing support. We extend previous work on deep learning models for Python version identification, where LSTM and CodeBERT achieved a 92% accuracy on short code snippets. We further expand these results to larger realistic files, utilising code segmentation techniques for varying input granularities. These techniques ranged from per-line analysis to larger code segments. Our findings show that while LSTM with CodeBERT embeddings maintained high accuracy on short snippets, performance significantly drops on longer segments, particularly in balancing information retention and misclassification risks. Notably, import-statement analysis, despite being the most intuitive indicator of version requirements, reached only a 30% accuracy. This exposes the limitations of our approach when encountering rare or user-defined modules. The findings expose the limitations of deep learning for language version identification, and suggest that alternative approaches may be necessary for high accuracy on larger datasets.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2024
Place of PublicationPiscataway, NJ
PublisherIEEE
Pages137-146
Number of pages10
ISBN (Electronic)979-8-3503-3066-3
ISBN (Print)979-8-3503-3067-0
DOIs
Publication statusPublished - 16 Jul 2024
Event31st IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2024 - Rovaniemi, Finland
Duration: 12 Mar 202415 Mar 2024

Publication series

NameProceedings IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)
PublisherIEEE
Volume2024
ISSN (Print)1534-5351
ISSN (Electronic)2640-7574

Conference

Conference31st IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2024
Country/TerritoryFinland
CityRovaniemi
Period12/03/2415/03/24

Keywords

  • CodeBERT
  • Deep Learning (DL)
  • Python
  • Software language identification

Fingerprint

Dive into the research topics of 'The Limits of the Identifiable: Challenges in Python Version Identification with Deep Learning'. Together they form a unique fingerprint.

Cite this