Abstract
The evolution of Python requires accurate version identification to facilitate compatibility and ongoing support. We extend previous work on deep learning models for Python version identification, where LSTM and CodeBERT achieved a 92% accuracy on short code snippets. We further expand these results to larger realistic files, utilising code segmentation techniques for varying input granularities. These techniques ranged from per-line analysis to larger code segments. Our findings show that while LSTM with CodeBERT embeddings maintained high accuracy on short snippets, performance significantly drops on longer segments, particularly in balancing information retention and misclassification risks. Notably, import-statement analysis, despite being the most intuitive indicator of version requirements, reached only a 30% accuracy. This exposes the limitations of our approach when encountering rare or user-defined modules. The findings expose the limitations of deep learning for language version identification, and suggest that alternative approaches may be necessary for high accuracy on larger datasets.
| Original language | English |
|---|---|
| Title of host publication | Proceedings - 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2024 |
| Place of Publication | Piscataway, NJ |
| Publisher | IEEE |
| Pages | 137-146 |
| Number of pages | 10 |
| ISBN (Electronic) | 979-8-3503-3066-3 |
| ISBN (Print) | 979-8-3503-3067-0 |
| DOIs | |
| Publication status | Published - 16 Jul 2024 |
| Event | 31st IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2024 - Rovaniemi, Finland Duration: 12 Mar 2024 → 12 Mar 2024 Conference number: 31 |
Publication series
| Name | Proceedings IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) |
|---|---|
| Publisher | IEEE |
| Volume | 2024 |
| ISSN (Print) | 1534-5351 |
| ISSN (Electronic) | 2640-7574 |
Conference
| Conference | 31st IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2024 |
|---|---|
| Abbreviated title | SANER 2024 |
| Country/Territory | Finland |
| City | Rovaniemi |
| Period | 12/03/24 → 12/03/24 |
Keywords
- CodeBERT
- Deep Learning (DL)
- Python
- Software language identification