chore: add integration tests for threading by Henrrypg · Pull Request #235 · openedx/openedx-ai-extensions

Henrrypg · 2026-06-18T19:48:59Z

This PR add some missing integration tests for threading

openedx-webhooks · 2026-06-18T19:49:04Z

Thanks for the pull request, @Henrrypg!

This repository is currently maintained by @felipemontoya.

Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review.

🔘 Get product approval

If you haven't already, check this list to see if your contribution needs to go through the product review process.

If it does, you'll need to submit a product proposal for your contribution, and have it reviewed by the Product Working Group.
- This process (including the steps you'll need to take) is documented here.
If it doesn't, simply proceed with the next step.

🔘 Provide context

To help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:

Dependencies

This PR must be merged before / after / at the same time as ...
Blockers

This PR is waiting for OEP-1234 to be accepted.
Timeline information

This PR must be merged by XX date because ...
Partner information

This is for a course on edx.org.
Supporting documentation
Relevant Open edX discussion forum threads

🔘 Get a green build

If one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green.

Details

Where can I find more information?

If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:

When can I expect my changes to be merged?

Our goal is to get community contributions seen and reviewed as efficiently as possible.

However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:

The size and impact of the changes that it introduces
The need for product review
Maintenance status of the parent repository

💡 As a result it may take up to several weeks or months to complete a review and merge your PR.

codecov · 2026-06-18T19:51:39Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.32%. Comparing base (98f555e) to head (384097d).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #235   +/-   ##
=======================================
  Coverage   95.32%   95.32%           
=======================================
  Files          69       69           
  Lines        8086     8086           
  Branches      432      432           
=======================================
  Hits         7708     7708           
  Misses        283      283           
  Partials       95       95

Flag	Coverage Δ
unittests	`95.32% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

felipemontoya

The tests look good to me, specially the third, but I think we should refactor them into using the provider_capabilities dict.

felipemontoya · 2026-06-19T21:08:47Z

+
+@pytest.mark.live_llm
+@pytest.mark.django_db
+@pytest.mark.skipif(not os.environ.get("OPENAI_API_KEY"), reason="OPENAI_API_KEY not set")


When we set this specifically to OPENAI_API_KEY we are tying our understading of "which features are supported by which provider" to test case. We already have a way of mapping this in https://github.com/openedx/openedx-ai-extensions/blob/main/backend/openedx_ai_extensions/processors/llm/providers/__init__.py.

Can we make this skipif depend on the _PROVIDER_CAPABILITIES?

felipemontoya · 2026-06-19T21:09:52Z

+    """
+    When session.remote_response_id points to a non-existent / expired
+    OpenAI thread, the processor must catch previous_response_not_found,
+    clear the stale ID, start a fresh thread, and return a valid response.


Should we start a completely fresh thread or prehaps re-load the conversation we already have in submissions?

felipemontoya · 2026-06-19T21:10:06Z

+
+@pytest.mark.live_llm
+@pytest.mark.django_db
+@pytest.mark.skipif(not os.environ.get("OPENAI_API_KEY"), reason="OPENAI_API_KEY not set")


same, the capability we are looking for is server_side_thread_id

felipemontoya · 2026-06-19T21:13:26Z

+@pytest.mark.live_llm
+@pytest.mark.django_db
+@pytest.mark.skipif(not os.environ.get("OPENAI_API_KEY"), reason="OPENAI_API_KEY not set")
+def test_three_turn_context_chain(live_user, course_key):


This is something we want to tests for the way we call all providers. not just providers with server_side_id.

In the anthropic case, this test should still pass as our processor will construct the message from the local submissions and pass it. I think we should remove the limit of openai only and test that we pass it for all or rework a bit the test until we do.

On the other hand, kudos for the existence of this test. I will become one of the critical bits of safety net we could have for the threaded orchestrator (learner chatbot).

felipemontoya · 2026-06-19T21:14:06Z

+    a neutral turn 2 that does not reference it.  Verifies that the server-
+    side thread correctly chains three consecutive turns.
+    """
+    from openedx_ai_extensions.processors.llm.llm_processor import LLMProcessor  # pylint: disable=C0415


if we are importing this for every test, why not put it at the top?

felipemontoya · 2026-06-19T21:16:21Z

+    )
+    session.refresh_from_db()
+
+    # Turn 1 — plant memorable fact (sent via previous_response_id)


this does not need to be sent via previous_response_id. As I'm pointing out, we also want to test this for every other provider.

Importantly, we are working towards having gemini and a openweights model in hg in this test suite pretty soon.

felipemontoya · 2026-06-19T21:18:47Z

+
+
+@pytest.mark.live_llm
+@pytest.mark.skipif(not os.environ.get("ANTHROPIC_API_KEY"), reason="ANTHROPIC_API_KEY not set")


this is something we want to test for any llm that supports the multi_turn_cache feature in _PROVIDER_CAPABILITIES

felipemontoya · 2026-06-19T21:22:03Z

+
+
+@pytest.mark.live_llm
+@pytest.mark.skipif(not os.environ.get("ANTHROPIC_API_KEY"), reason="ANTHROPIC_API_KEY not set")


This might be the only test that is anthropic specific but I would not mind running this for other providers. A short message should not crash them with or without cache.

felipemontoya · 2026-06-19T21:31:54Z

+
+    session = create_live_session(
+        live_user, course_key,
+        remote_response_id="resp_fake_expired_thread_id_xyz_000000",


does OpenAI return previous_response_not_found for this, or does it return a 400 (malformed ID) or some other error? A made-up format string might trigger a different error code entirely. If the recovery code only catches that specific code, the test wouldn't actually exercise it. We can set it to a thread we know has expired

felipemontoya · 2026-06-19T21:37:38Z

+
+    session = create_live_session(live_user, course_key)
+
+    # Turn 0 — initialise the remote thread (system messages only; no user input


the test is relying on a side effect from a call that behaves incorrectly by design, and the phrase "current logic" means this silently changes if the logic changes.

This is a smell that is probably highlighting an conditional management from the underlying logic. Maybe we should fix the user_input reaching openai in the providers module where we do conditionals.

felipemontoya · 2026-06-19T21:40:56Z

+    }
+
+    # First call — warms the cache
+    proc1 = LLMProcessor(config=config, user_session=MagicMock(remote_response_id=None))


The file's own docstring says DB rows are used "so that session.save() exercises the actual persistence layer." But the Anthropic cache tests use MagicMock(remote_response_id=None), where save() silently disappears. If Anthropic's code path also calls session.save() (e.g., to update some state), the mock swallows it undetected. The inconsistency is unexplained.

Is there a key reason for this difference? can we have both providers (and the new providers that may come) exercise the exact same code path?

felipemontoya · 2026-06-19T21:42:26Z

+    r2 = proc2.process(context=_LONG_SYSTEM_CONTEXT, input_data="Summarize this in one sentence.")
+    assert r2.get("status") == "success", f"Second call failed: {r2}"
+
+    usage = proc2.get_usage()


do we really need to run get_usage here? it seems like something that would either requiere its own testing file for different providers or something that we ignore given that is litellm providing it and not something that is operation critical for us.

felipemontoya · 2026-06-19T21:46:03Z

+    assert result2.get("status") == "success", f"Turn 2 failed: {result2}"
+    response_text = (result2.get("response") or "").lower()
+    assert len(response_text) > 5, "Turn 2 produced an empty response"
+    assert "python" in response_text, (


"python" appears in DUMMY_CONTENT. Any call that includes DUMMY_CONTENT as context (including a completely fresh session with no recovery at all) would satisfy assert "python" in response_text.

We could ask something like, who is the creator of the programing language we are discussing and test for "guido".

I tested this with claude and here is the response:

The assertion would pass even if the recovery code was completely broken and turn 2 had zero awareness of turn 1.

What would actually solve it is planting something in turn 1 that can't be inferred from the context or general knowledge — something the model can only recall if it has access to what was said in turn 1:

▎ Turn 1: "My student ID is ZEPHYR-9142. Just say 'Got it.'"
▎ Turn 2: "What is my student ID?"
▎ Assert "9142" in response

This would imply that we are not starting fresh but actually loading turns back into the thread from our db

chore: add integration tests for threading

384097d

openedx-webhooks added open-source-contribution PR author is not from Axim or 2U core contributor PR author is a Core Contributor (who may or may not have write access to this repo). labels Jun 18, 2026

openedx-webhooks added this to Contributions Jun 18, 2026

github-project-automation Bot moved this to Needs Triage in Contributions Jun 18, 2026

felipemontoya requested changes Jun 19, 2026

View reviewed changes

felipemontoya reviewed Jun 19, 2026

View reviewed changes



		@pytest.mark.live_llm
		@pytest.mark.skipif(not os.environ.get("ANTHROPIC_API_KEY"), reason="ANTHROPIC_API_KEY not set")


		session = create_live_session(live_user, course_key)

		# Turn 0 — initialise the remote thread (system messages only; no user input

Conversation

Henrrypg commented Jun 18, 2026

Uh oh!

openedx-webhooks commented Jun 18, 2026

Uh oh!

codecov Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

felipemontoya left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felipemontoya Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Jun 18, 2026 •

edited

Loading

felipemontoya Jun 19, 2026 •

edited

Loading