Saif:
Action Items:
This week I finally completed the one-page summary of related papers.
I read the following paper and would like to discuss today:
S. Ouyang, J. M. Zhang, M. Harman, and M. Wang, “An Empirical Study of the Non-Determinism of ChatGPT in Code Generation,” TOSEM 2025
Goal:
Quantify and characterize the non-determinism of ChatGPT in code generation across multiple datasets and model versions (GPT-3.5 and GPT-4), with systematic measurement of semantic, syntactic, and structural similarity.
Detection baselines:
Gap:
They did not analyze the actual content or correctness of the outputs when OER = 0. Specifically:
They didn’t inspect the generated code to see what kinds of errors caused the output differences (e.g., logic bug, wrong formula, off-by-one).
They didn’t verify whether any of the differing outputs are actually correct, even if others are not.
They didn’t explore whether the output differences are harmless or harmful (e.g., stylistic differences vs logic bugs).
Question:
what would be the role prompt ambiguity or under-specification on non-determinism?
Note: I’ve had a cold (sneezing and runny nose) over the last few days. The allergy meds are making me a bit drowsy, but I’m still planning to go ahead with the presentation. I’ll let you know if anything changes.
Saif:
Hi group,
Action Items:
This week I read the following papers:
[1] H. Suh et al, “An Empirical Study on Automatically Detecting AI-Generated Source Code: How Far Are We?” ArXiv, Nov. 2024, doi: [10.48550/arXiv.2311.00005].
[2] B. Demirok and M. Kutlu, “AIGCodeSet: A New Annotated Dataset for AI Generated Code Detection,” ArXiv, Dec. 2024, doi: [10.48550/arXiv.2312.00020].
Questions:
Saif:
Hi group,
Action Items
This week, I mainly focused on the literature for the RQs.
RQ 1 – How do LLM-generated solutions compare to human-written code in terms of functional correctness and vulnerability profiles?
Functional correctness:
Vulnerability and CWE profile comparisons
RQ 2 – Role of prompt engineering & decoding controls in reducing functional and security errors
Several prompt engineering techniques have been discussed in the prior literature to mitigate errors:
Results:
I lagged behind running the experiments on test pass / fail rates so there’s no results to share.
Questions:
NA
Saif:
Hello group,
Milestone: Vulnerability analysis in generated code - Complete by Aug 7, 25
Action Item* : Formulate Research Questions
I have formulated a few research questions that I would like to discuss in our next meeting.
Action Item 1 – in progress
Evaluate Code Generation Behavior Across Temperature.
I am investigating the effect of temperature on code generation. My focus is to find an optimal temperature.
According to the literature, single best temperature rarely exists.
- For single-shot generation aiming at correctness, low T (0 – 0.4) is usually optimal
- For multi-sample pass@k evaluation, moderate T (0.4 – 0.8) with larger k improves coverage
I will attach a detailed report on this by tonight.
Action Item 2 – partially complete
Analyze Vulnerability Patterns and pass/fail statistics
For this action Item, I have shared the CWE results for 90 generated solutions for different temperatures. Statistics for pass/fail comparison is due.
Action Item 3 – pending
Submit solutions to the original platforms
I am yet to start this action item.
Results/Findings:
N/A
Questions / Issues:
N/A
(Apologies for the delayed response, I set up the environment on the new laptop until late night and missed my alarm in the morning)
Saif:
Hi group,
Milestone: yet to decide depending on the deadlines of selected venue
Action Item 1 – Saturday, July 5 (Done)
Change temperature setting is to increase randomness in output
Action Item 2 – Tuesday, July 8 (Done)
Execute test cases on the 276 generated solutions to verify correctness
Plan for the next week:
Results/Findings:
Questions/Issues:
Saif:
Hi group,
Milestone: I am yet to decide a concrete study and milestone date
Action Items completed:
Action Item 1 – Saturday, June 28 (done)
Action Item 2 – Thursday, July 3 (done for 10 problems with 3 generated solutionsfor each)
Results/Findings:
Questions/issues:
Saif:
Hi group,
Action Item in progress:
We will construct our own dataset using problem instructions and corresponding code samples
This week I continued working on dataset creation using DeepMind’s code_contests dataset. I have successfully extracted 342 problems from the first chunk. From those, I identified 276 problems that include at least one correct C++ solution.
I configured and ran CodeLlama-7B.Q4_K_M.gguf locally via Ollama, and used it to generate C++ solutions for all 276 problems. The model was able to produce complete C++ code snippets based solely on the problem descriptions.
Currently, I am beginning to compare CWE (Common Weakness Enumeration) differences between the human-written C++ solutions and the ones generated by CodeLlama. This will help us assess whether LLM-generated code introduces or avoids common coding vulnerabilities when compared to human-authored examples.
Results/Findings:
Question/Issue:
No blockers at the moment. Continuing with CWE analysis.
Replies:
Dr. Lei:
Sounds good. Try to think deep and see if you can make any interesting observations about the results.
No update was provided for the week ending 2025-04-30.
Saif:
Hi group,
Milestone: Identify a research topic - May 04
Last week, I generated a report on existing datasets that include vulnerable codes generated by various LLMs reported by existing literature.
This week I am focusing on the following action items:
Replies:
Dr. Lei:
try to send a written report about your findings before thu meeting. the report does not have to be well-written or very detailed. but it should contain the major points.
Saif:
Hi Group,
Milestone: Diagnostic Evaluation (awaiting response from Dr. Khalili)
Milestone Complete a Literature review on LLM Explanation May 4
This week I focused on the following action items:
- Empirical and Theoretical Distinguishing Characteristics: Stylistic/Structural Patterns: LLM code tends to be more formulaic, concise, and uses narrower token/naming distributions; comments are often more consistent but sometimes mismatched with code
- Coding Style Markers: Differences in variable/class naming, control flow depth, code and comment length/distribution; human code is more diverse
2. Publicly available LLM-generated code datasets with vulnerability :
- Scope of Security Bugs and Limitations: Most datasets focus on well-known bug/vulnerability types (CWE Top-25, input validation, memory safety, etc.), with variable coverage of edge cases, multi-step chain issues, or “in-the-wild” project code
Replies:
Dr. Lei:
good job. a few comments:
Saif:
hi group,
Milestone: Diagnostic Evaluation (awaiting response from Dr. Khalili)
Milestone Complete a Literature review on LLM Explanation May 4
This week I focused on the following paper:
Additional note:
Replies:
Dr. Lei:
sounds good. please prepare a presentation on your findings. try to focus on the most important points. also try to think about three ideas that you could think of on the topic of security testing of LLM-generated code, and discuss them when we meet on thursday
Saif:
Ok Professor.
Saif:
Hello group,
PhD Milestone: Diagnostic Evaluation (April 7)
Milestone Complete a Literature review on LLM Explanation May 4
New Study: Vulnerabilities in LLM generated code
Last week I have went through the following paper:
I am conducting a literature search for Vulnerabilities in LLM generated code. I have shared the incomplete list in my channel. I will provide the comprehensive list tomorrow (Wednesday, 4/2) and plan to discuss on Thursday.
Replies:
Dr. Lei:
sounds good. please prepare some slides for thu meeting to discuss
Saif:
Good morning,
Milestone: Diagnostic Evaluation (April 7)
Milestone Complete a Literature review on LLM Explanation May 4
This week I focused on the challenges in defining locality for generating accurate explanations. I am trying to understand different versions of LIME. I have gone through the following papers:
This paper proposes an alternative sampling method to improve the local fidelity of surrogate models and evaluates it against LIMEi
The paper investigates the challenges of generating sample points in an instance’s neighborhood, balancing interpretability with explanation accuracy, and determining the appropriate sample size. The findings emphasize issues with LIME’s kernel-based weighting and boundary approximation.
Saif:
Good morning,
Milestone: Diagnostic Evaluation
Current work: LLM explanation
Last week I tried to run a part of the experiment from the paper I have presented last time to have a good grasp on LIME.
For diagnostic, I have sent out emails to the tentative committee. Two of the professors have already replied. I am awaiting Dr. Ji’s response.
Replies:
Dr. Lei:
please try to make your status report more informative
Saif:
Hello group,
Milestone: Diagnostic Evaluation (Tentative - 3rd Week of April)
Milestone Complete a Literature review on LLM Explanation May 4
This week I am planning to run the experiments from the empirical study I have presented on Friday.
Replies:
Dr. Lei:
we need to discuss who you will invite to be on your committee.
Saif:
Yes Dr. Lei. I have sent you a direct message on slack describing my current academic progress and tentative Professor list. Thank you.
Saif:
Hello group,
Milestone: Diagnostic Evaluation (Tentative - First Week of April)
Current work: LLM Explanation (Milestones yet to be decided)
Action items completed:
Replies:
Dr. Lei:
please note that the purpose fo the New Ideas session is to introduce interesting ideas/perspectives to the group. try to select a topic/paper that truly excites you; otherwise, it would not serve the purpose.
Saif:
Hello group,
Milestone: Diagnostic Evaluation (Tentative - First Week of April)
Current work: LLM Explanation
Action items completed:
After the discussion of Friday meeting, I have compiled a comprehensive list of papers that covers perturbation for LLMs. I have added the comprehensive list to my channel.
Action items for this week:
I have not yet organized the workflow and milestones dates. But this week, I will focus on setting up and running experiments.
Replies:
Dr. Lei:
good job on the collection of papers. at this stage, i suggest you give priority to the big picture before you go into the details of each paper. try to find a good survey paper on this topic first, if exist.
Saif:
Hello group,
Milestone: Diagnostic Evaluation (Tentative - First Week of April)
Current work: Investigating Sparse AutoEncoders for LLM Explanation
I have been studying about the approaches of LLM explanation. I came along this paper that discusses the existing techniques and challenges of the approaches: “Explainability for Large Language Models: A Survey” - by H Zhao · 2024
From the existing approaches, I would like to explore the following two–
Feature Interpretability via Concept Vectors (CAVs):
What it is: Use post-hoc linear classifiers or other mechanisms to define high-level concept vectors in the latent space that align with human-understandable constructs (e.g., sentiment, gender).
Applications:
Explain model decisions in human terms by projecting latent representations onto concept vectors.
Diagnose the presence of biases or specific abstract features (e.g., fairness in language models).
Sparse Feature Extraction:
What it is: Neuroscience-inspired methods like sparse autoencoders or dictionary learning extract sparse or disentangled features from LLM activation spaces.
Replies:
Dr. Lei:
as i suggested, focus on the big picture first before you dive into the details of a particular approach
Saif:
Hello group,
Milestone: Diagnostic Evaluation (Tentative - First Week of April)
Current work: Investigating Sparse AutoEncoders for LLM Explanation
Action items done in the previous week:
– Regarding fundamentals of SAEs, I came along with the following:
Action items for next week:
– My primary plan for this week is to look for small open source SAE models that can be trained in smaller scale
– And continue grouping the papers
Replies:
Dr. Lei:
i suggest you think more about the problem, i.e., explaining LLM models. in addition to SAEs, what are other approaches to LLM explanations? what are the general technical problems, and what the existing approaches to solving these problems?
Saif:
Ok Professor. I will look into this according to your suggestion.
Saif:
Hello group,
Milestone: Diagnostic Evaluation(Tentative - First Week of April)
Current work: Investigating Sparse AutoEncoders for LLM Explanation
Action items completed:
2. Challenges in Implementation:
Team meeting update:
I was working on compiling the papers. There was not much to discuss this week so we skipped the meeting.
Action items for this week:
I will continue summarizing the papers from the complied list. I would like to sign up for a technical discussion on Friday on my finding on this topic.
Replies:
Dr. Lei:
looks good. keep up with the good work
Saif:
Good morning group,
Milestone: Diagnostic Evaluation (Tentative - First Week of April)
Milestone: Literature Review Draft - January 23, 2025
Collaboration with Fadul:
Replies:
Dr. Lei:
sounds good. please work with Fadul and Sunny on possible new topics. the current focus for you is to find a topic to dive into in the next two weeks.
No update was provided for the week ending 2024-12-18.
Saif:
Hi Group,
Updated Milestone:Literature Review Draft - January 23, 2025
Based on the feedback on Friday’s meeting, I have updated the timeline for the literature review.
Action Items from last week:
After Friday, I refined the study design section, added research questions, and identified that performance metric summaries needed to be included..
I have an exam to proctor today and after that the Professor would like to discuss the grading with me. I might be late to join the meeting.
Replies:
Dr. Lei:
this is probably your meaningful status update, which i am really happen to see. keep it up
Saif:
Thank you Dr. Lei.
Saif:
Action items Completed:
We had a general discussion with Fadul regarding collaboration.
Saif:
Hi group,
I had Compilers final exam on Friday and two project submissions until yesterday. I couldn’t spend much time on research.
I will be presenting the new idea today.
Saif:
Milestone: Complete Literature Survey - Dec 5
Saif:
Hi group,
Milestone: Complete Literature Survey - Dec 5
Saif:
Hi group,
Milestone: Complete Literature Survey - Dec 5
Action items–
Last week, I have summarized these four papers for the literature survey
This Friday, I will brief my current findings from the literature search.
Replies:
Dr. Lei:
if i remember correctly, i suggested you make a schedule towards the paper submission. can you share the schedule?
Saif:
Hi group,
Milestone: Complete Literature Survey - Dec 5
Action items
Last week, I have summarized these three papers for the literature survey
I will present my current findings to the group on the following Friday (11/8).
Saif:
Milestone: Complete a Survey paper - Dec 5 (venue yet to decide)
Action items -
Last week I have read these four papers from my literature search and jotted down the problems they are addressing, key insights and inspiration for solution, steps they are taking toward the solution, and the inputs and outputs generated from their approach
Replies:
Dr. Lei:
please break down this milestone and have some intermediate deliverables, so that you can gauge the intermediate progress.
Saif:
Milestone: Complete a Survey paper - Dec 5 (venue yet to decide)
– for research I have updated the list for my literature search.
– I have documented papers that use LLM approach for source code smell / code summarization task
– I was mostly working on my course projects last week
– I have been grading Midterm papers and an assignment for TA work
Saif:
Hello everyone,
Broad Milestone: Complete Code smell detection project by Thanksgiving (Nov 26)
Complete grouping the papers Sep 29 - Partially completed
Present the findings of the existing literature Oct 11th. for this one I am preparing a presentation to show the findings of LLM approaches in codesmell detection.
Action Items from last week:
Replies:
Dr. Lei:
For a milestone to be meaningful, it must have clear deliverables and an objective way to check whether it is completed. Speaking of a research project being completed, i would expect a research paper to be submitted.
Dr. Lei:
also for your project to make real progress, i want you to make a proposal to the group. in the proposal, please clearly specify what problem you are trying to address, in terms of input/output, what are the technical challenges, and what are your ideas to address the challenges, and how your idea would compare to existing work. also you must break down a project into smaller tasks and put a target date on each small task.
Saif:
Thank you for the feedback Professor.
I’ll revise the milestone to clearly define goals and to break it into smaller tasks.
Our plan was to aim for survey paper. Should I target for a conference in December/January deadline?
Dr. Lei:
typically a survey paper is published in a journal paper. not many conferences accept survey papers. one target journal is ACM Computing Surveys. journal submission can be made anytime, i.e., there is no specific deadlines. i suggest you target the end of this semester.
Saif:
Sounds good. I will adjust the milestone deadline accordingly.
No update was provided for the week ending 2024-10-02.
Saif:
Hello everyone,
Broad Milestone: Complete Code smell detection project by Thanksgiving (Nov 26)
Action Items from last week:
I am still working categorization of the papers for the literature review
No update was provided for the week ending 2024-09-17.
No update was provided for the week ending 2024-09-13.
© 2023 Jeff Lei's Lab.
Site made with Jekyll. Template from the Allan Lab
We are part of the CSE Department at University of Texas at Arlington.