Saif:
Hello,
This week I focused on the following action items:
Action Item 1 – Monday, Jun 16
“From the targeted dataset, pick 25-30 samples and manually generate codes using an open-source online platform (Google Gemma). Then check vulnerability of the generated codes using a security tool. “
Action Item 2 – Monday, Jun 19
Conduct a survey on datasets that are exploited to feed classifiers to detect human code and generated code.
I will continue to working on the constructing the dataset.
Saif:
Hi group,
Action Item 1: Conduct an exhaustive search to identify any existing datasets that contain problem instructions as prompts, along with generated code and human-written code samples. (Due: June 9 - Done)
Action Item 2: We will construct our own dataset using problem instructions and corresponding code samples (Due: June 12 - in progress)
Results/Findings:
Question/Issue:
(I needed fix the WLAN card issue on my laptop. Apologies for the delayed post)
Saif:
Hi Team,
Action Items in Progress:
Last week, I focused on developing a chat-based application to generate code from prompts, for the study. I experimented with both OpenAI and HuggingFace models.
issues:
Our target dataset (FormAI-V2) does not include the original prompts. I’ve reached out to the authors requesting access, but haven’t received any response yet. I’ve been using CS 1310 assignment prompts.
I’ve run out of OpenAI credits, so I switched to using HuggingFace models. However, as HuggingFace is a free service, it is time consuming to get access.
Note: I am looking for alternative datasets that contain the original prompts along with the generated C codes.
No update was provided for the week ending 2025-06-04.
No update was provided for the week ending 2025-05-28.
Saif:
Action items completed:
Explain a few diverse Example from the Dataset -
Follow up action item: any good dataset that has prompt? – I have sent an email to the authors of FormAI-V2 and awaiting their response
Action Items in Progress:
Divergence & Regeneration Experiment: May 14
• Identify where LLM diverges from secure code.
• Replace early vulnerable token → regenerate.
• Measure if vulnerability is avoided.
As I do not have access to robust prompts, I only could rely on basics prompts for class level assignments.
Observation: in general, even if the faulty token is fixed, the LLM still hallucinate in the next iteration while generating further tokens.
I came across a relevant paper published in AAAI 2025: “CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification”. a short summary of the paper as follows:
CodeHalu focuses on “code hallucinations”—code generated by LLMs that looks correct but fails when executed. Instead of relying on static analysis, the authors run multiple outputs for each prompt and test them using real test cases. They then apply a two-step method to group the failures into four main types: mapping errors, naming mistakes, resource issues (like wrong memory size), and logic bugs (such as infinite loops or type mismatches). They build a dataset called CodeHaluEval with 8,883 labeled samples from 699 tasks and evaluate 17 LLMs. Results show larger models (like GPT-4, LLaMA-3) hallucinate less, but logic errors remain common. This approach provides a practical way to detect and understand hidden failures in generated code.
No update was provided for the week ending 2025-05-14.
Saif:
Milestone: Identify Research Topic (May 04)
This week I do not have a reportable progress due to my final projects and TA works. I look forward to working on the action items listed from the previous week.
1) Explain a few diverse Example from the Dataset: May 7
• Select one LLM-generated C code with a known vulnerability.
• Show: prompt, vulnerable line, CWE type, and secure (human/LLM) counterpart.
• Purpose: Understand where and why the LLM fails.
2) Divergence & Regeneration Experiment: May 14
• Identify where LLM diverges from secure code.
• Replace early vulnerable token → regenerate.
• Measure if vulnerability is avoided.
3) Compare LLM-Generated vs Human-Written C. Perform an empirical study: May 28
• Compare vulnerability rates and CWE distributions.
• Analyze structural/code-style differences.
• Purpose: Quantify how vulnerabilities differ across LLM and human code.
No update was provided for the week ending 2025-04-30.
Saif:
Hi group,
Milestone: Identify a research topic - May 04
Last week, I generated a report on existing datasets that include vulnerable codes generated by various LLMs reported by existing literature.
This week I am focusing on the following action items:
Replies:
Dr. Lei:
try to send a written report about your findings before thu meeting. the report does not have to be well-written or very detailed. but it should contain the major points.
Saif:
Hi Group,
Milestone: Diagnostic Evaluation (awaiting response from Dr. Khalili)
Milestone Complete a Literature review on LLM Explanation May 4
This week I focused on the following action items:
- Empirical and Theoretical Distinguishing Characteristics: Stylistic/Structural Patterns: LLM code tends to be more formulaic, concise, and uses narrower token/naming distributions; comments are often more consistent but sometimes mismatched with code
- Coding Style Markers: Differences in variable/class naming, control flow depth, code and comment length/distribution; human code is more diverse
2. Publicly available LLM-generated code datasets with vulnerability :
- Scope of Security Bugs and Limitations: Most datasets focus on well-known bug/vulnerability types (CWE Top-25, input validation, memory safety, etc.), with variable coverage of edge cases, multi-step chain issues, or “in-the-wild” project code
Replies:
Dr. Lei:
good job. a few comments:
Saif:
hi group,
Milestone: Diagnostic Evaluation (awaiting response from Dr. Khalili)
Milestone Complete a Literature review on LLM Explanation May 4
This week I focused on the following paper:
Additional note:
Replies:
Dr. Lei:
sounds good. please prepare a presentation on your findings. try to focus on the most important points. also try to think about three ideas that you could think of on the topic of security testing of LLM-generated code, and discuss them when we meet on thursday
Saif:
Ok Professor.
Saif:
Hello group,
PhD Milestone: Diagnostic Evaluation (April 7)
Milestone Complete a Literature review on LLM Explanation May 4
New Study: Vulnerabilities in LLM generated code
Last week I have went through the following paper:
I am conducting a literature search for Vulnerabilities in LLM generated code. I have shared the incomplete list in my channel. I will provide the comprehensive list tomorrow (Wednesday, 4/2) and plan to discuss on Thursday.
Replies:
Dr. Lei:
sounds good. please prepare some slides for thu meeting to discuss
Saif:
Good morning,
Milestone: Diagnostic Evaluation (April 7)
Milestone Complete a Literature review on LLM Explanation May 4
This week I focused on the challenges in defining locality for generating accurate explanations. I am trying to understand different versions of LIME. I have gone through the following papers:
This paper proposes an alternative sampling method to improve the local fidelity of surrogate models and evaluates it against LIMEi
The paper investigates the challenges of generating sample points in an instance’s neighborhood, balancing interpretability with explanation accuracy, and determining the appropriate sample size. The findings emphasize issues with LIME’s kernel-based weighting and boundary approximation.
Saif:
Good morning,
Milestone: Diagnostic Evaluation
Current work: LLM explanation
Last week I tried to run a part of the experiment from the paper I have presented last time to have a good grasp on LIME.
For diagnostic, I have sent out emails to the tentative committee. Two of the professors have already replied. I am awaiting Dr. Ji’s response.
Replies:
Dr. Lei:
please try to make your status report more informative
Saif:
Hello group,
Milestone: Diagnostic Evaluation (Tentative - 3rd Week of April)
Milestone Complete a Literature review on LLM Explanation May 4
This week I am planning to run the experiments from the empirical study I have presented on Friday.
Replies:
Dr. Lei:
we need to discuss who you will invite to be on your committee.
Saif:
Yes Dr. Lei. I have sent you a direct message on slack describing my current academic progress and tentative Professor list. Thank you.
Saif:
Hello group,
Milestone: Diagnostic Evaluation (Tentative - First Week of April)
Current work: LLM Explanation (Milestones yet to be decided)
Action items completed:
Replies:
Dr. Lei:
please note that the purpose fo the New Ideas session is to introduce interesting ideas/perspectives to the group. try to select a topic/paper that truly excites you; otherwise, it would not serve the purpose.
Saif:
Hello group,
Milestone: Diagnostic Evaluation (Tentative - First Week of April)
Current work: LLM Explanation
Action items completed:
After the discussion of Friday meeting, I have compiled a comprehensive list of papers that covers perturbation for LLMs. I have added the comprehensive list to my channel.
Action items for this week:
I have not yet organized the workflow and milestones dates. But this week, I will focus on setting up and running experiments.
Replies:
Dr. Lei:
good job on the collection of papers. at this stage, i suggest you give priority to the big picture before you go into the details of each paper. try to find a good survey paper on this topic first, if exist.
Saif:
Hello group,
Milestone: Diagnostic Evaluation (Tentative - First Week of April)
Current work: Investigating Sparse AutoEncoders for LLM Explanation
I have been studying about the approaches of LLM explanation. I came along this paper that discusses the existing techniques and challenges of the approaches: “Explainability for Large Language Models: A Survey” - by H Zhao · 2024
From the existing approaches, I would like to explore the following two–
Feature Interpretability via Concept Vectors (CAVs):
What it is: Use post-hoc linear classifiers or other mechanisms to define high-level concept vectors in the latent space that align with human-understandable constructs (e.g., sentiment, gender).
Applications:
Explain model decisions in human terms by projecting latent representations onto concept vectors.
Diagnose the presence of biases or specific abstract features (e.g., fairness in language models).
Sparse Feature Extraction:
What it is: Neuroscience-inspired methods like sparse autoencoders or dictionary learning extract sparse or disentangled features from LLM activation spaces.
Replies:
Dr. Lei:
as i suggested, focus on the big picture first before you dive into the details of a particular approach
Saif:
Hello group,
Milestone: Diagnostic Evaluation (Tentative - First Week of April)
Current work: Investigating Sparse AutoEncoders for LLM Explanation
Action items done in the previous week:
– Regarding fundamentals of SAEs, I came along with the following:
Action items for next week:
– My primary plan for this week is to look for small open source SAE models that can be trained in smaller scale
– And continue grouping the papers
Replies:
Dr. Lei:
i suggest you think more about the problem, i.e., explaining LLM models. in addition to SAEs, what are other approaches to LLM explanations? what are the general technical problems, and what the existing approaches to solving these problems?
Saif:
Ok Professor. I will look into this according to your suggestion.
Saif:
Hello group,
Milestone: Diagnostic Evaluation(Tentative - First Week of April)
Current work: Investigating Sparse AutoEncoders for LLM Explanation
Action items completed:
2. Challenges in Implementation:
Team meeting update:
I was working on compiling the papers. There was not much to discuss this week so we skipped the meeting.
Action items for this week:
I will continue summarizing the papers from the complied list. I would like to sign up for a technical discussion on Friday on my finding on this topic.
Replies:
Dr. Lei:
looks good. keep up with the good work
Saif:
Good morning group,
Milestone: Diagnostic Evaluation (Tentative - First Week of April)
Milestone: Literature Review Draft - January 23, 2025
Collaboration with Fadul:
Replies:
Dr. Lei:
sounds good. please work with Fadul and Sunny on possible new topics. the current focus for you is to find a topic to dive into in the next two weeks.
No update was provided for the week ending 2024-12-18.
Saif:
Hi Group,
Updated Milestone:Literature Review Draft - January 23, 2025
Based on the feedback on Friday’s meeting, I have updated the timeline for the literature review.
Action Items from last week:
After Friday, I refined the study design section, added research questions, and identified that performance metric summaries needed to be included..
I have an exam to proctor today and after that the Professor would like to discuss the grading with me. I might be late to join the meeting.
Replies:
Dr. Lei:
this is probably your meaningful status update, which i am really happen to see. keep it up
Saif:
Thank you Dr. Lei.
Saif:
Action items Completed:
We had a general discussion with Fadul regarding collaboration.
Saif:
Hi group,
I had Compilers final exam on Friday and two project submissions until yesterday. I couldn’t spend much time on research.
I will be presenting the new idea today.
Saif:
Milestone: Complete Literature Survey - Dec 5
Saif:
Hi group,
Milestone: Complete Literature Survey - Dec 5
Saif:
Hi group,
Milestone: Complete Literature Survey - Dec 5
Action items–
Last week, I have summarized these four papers for the literature survey
This Friday, I will brief my current findings from the literature search.
Replies:
Dr. Lei:
if i remember correctly, i suggested you make a schedule towards the paper submission. can you share the schedule?
Saif:
Hi group,
Milestone: Complete Literature Survey - Dec 5
Action items
Last week, I have summarized these three papers for the literature survey
I will present my current findings to the group on the following Friday (11/8).
Saif:
Milestone: Complete a Survey paper - Dec 5 (venue yet to decide)
Action items -
Last week I have read these four papers from my literature search and jotted down the problems they are addressing, key insights and inspiration for solution, steps they are taking toward the solution, and the inputs and outputs generated from their approach
Replies:
Dr. Lei:
please break down this milestone and have some intermediate deliverables, so that you can gauge the intermediate progress.
Saif:
Milestone: Complete a Survey paper - Dec 5 (venue yet to decide)
– for research I have updated the list for my literature search.
– I have documented papers that use LLM approach for source code smell / code summarization task
– I was mostly working on my course projects last week
– I have been grading Midterm papers and an assignment for TA work
Saif:
Hello everyone,
Broad Milestone: Complete Code smell detection project by Thanksgiving (Nov 26)
Complete grouping the papers Sep 29 - Partially completed
Present the findings of the existing literature Oct 11th. for this one I am preparing a presentation to show the findings of LLM approaches in codesmell detection.
Action Items from last week:
Replies:
Dr. Lei:
For a milestone to be meaningful, it must have clear deliverables and an objective way to check whether it is completed. Speaking of a research project being completed, i would expect a research paper to be submitted.
Dr. Lei:
also for your project to make real progress, i want you to make a proposal to the group. in the proposal, please clearly specify what problem you are trying to address, in terms of input/output, what are the technical challenges, and what are your ideas to address the challenges, and how your idea would compare to existing work. also you must break down a project into smaller tasks and put a target date on each small task.
Saif:
Thank you for the feedback Professor.
I’ll revise the milestone to clearly define goals and to break it into smaller tasks.
Our plan was to aim for survey paper. Should I target for a conference in December/January deadline?
Dr. Lei:
typically a survey paper is published in a journal paper. not many conferences accept survey papers. one target journal is ACM Computing Surveys. journal submission can be made anytime, i.e., there is no specific deadlines. i suggest you target the end of this semester.
Saif:
Sounds good. I will adjust the milestone deadline accordingly.
No update was provided for the week ending 2024-10-02.
Saif:
Hello everyone,
Broad Milestone: Complete Code smell detection project by Thanksgiving (Nov 26)
Action Items from last week:
I am still working categorization of the papers for the literature review
No update was provided for the week ending 2024-09-17.
No update was provided for the week ending 2024-09-13.
© 2023 Jeff Lei's Lab.
Site made with Jekyll. Template from the Allan Lab
We are part of the CSE Department at University of Texas at Arlington.