Skip to content

chore: add integration tests for threading#235

Open
Henrrypg wants to merge 1 commit into
openedx:mainfrom
eduNEXT:hpg/threading-tests
Open

chore: add integration tests for threading#235
Henrrypg wants to merge 1 commit into
openedx:mainfrom
eduNEXT:hpg/threading-tests

Conversation

@Henrrypg

Copy link
Copy Markdown
Contributor

This PR add some missing integration tests for threading

@openedx-webhooks

Copy link
Copy Markdown

Thanks for the pull request, @Henrrypg!

This repository is currently maintained by @felipemontoya.

Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review.

🔘 Get product approval

If you haven't already, check this list to see if your contribution needs to go through the product review process.

  • If it does, you'll need to submit a product proposal for your contribution, and have it reviewed by the Product Working Group.
    • This process (including the steps you'll need to take) is documented here.
  • If it doesn't, simply proceed with the next step.
🔘 Provide context

To help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:

  • Dependencies

    This PR must be merged before / after / at the same time as ...

  • Blockers

    This PR is waiting for OEP-1234 to be accepted.

  • Timeline information

    This PR must be merged by XX date because ...

  • Partner information

    This is for a course on edx.org.

  • Supporting documentation
  • Relevant Open edX discussion forum threads
🔘 Get a green build

If one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green.

Details
Where can I find more information?

If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:

When can I expect my changes to be merged?

Our goal is to get community contributions seen and reviewed as efficiently as possible.

However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:

  • The size and impact of the changes that it introduces
  • The need for product review
  • Maintenance status of the parent repository

💡 As a result it may take up to several weeks or months to complete a review and merge your PR.

@openedx-webhooks openedx-webhooks added open-source-contribution PR author is not from Axim or 2U core contributor PR author is a Core Contributor (who may or may not have write access to this repo). labels Jun 18, 2026
@github-project-automation github-project-automation Bot moved this to Needs Triage in Contributions Jun 18, 2026
@codecov

codecov Bot commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.32%. Comparing base (98f555e) to head (384097d).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #235   +/-   ##
=======================================
  Coverage   95.32%   95.32%           
=======================================
  Files          69       69           
  Lines        8086     8086           
  Branches      432      432           
=======================================
  Hits         7708     7708           
  Misses        283      283           
  Partials       95       95           
Flag Coverage Δ
unittests 95.32% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@felipemontoya felipemontoya left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests look good to me, specially the third, but I think we should refactor them into using the provider_capabilities dict.


@pytest.mark.live_llm
@pytest.mark.django_db
@pytest.mark.skipif(not os.environ.get("OPENAI_API_KEY"), reason="OPENAI_API_KEY not set")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we set this specifically to OPENAI_API_KEY we are tying our understading of "which features are supported by which provider" to test case. We already have a way of mapping this in https://github.com/openedx/openedx-ai-extensions/blob/main/backend/openedx_ai_extensions/processors/llm/providers/__init__.py.

Can we make this skipif depend on the _PROVIDER_CAPABILITIES?

"""
When session.remote_response_id points to a non-existent / expired
OpenAI thread, the processor must catch previous_response_not_found,
clear the stale ID, start a fresh thread, and return a valid response.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we start a completely fresh thread or prehaps re-load the conversation we already have in submissions?


@pytest.mark.live_llm
@pytest.mark.django_db
@pytest.mark.skipif(not os.environ.get("OPENAI_API_KEY"), reason="OPENAI_API_KEY not set")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, the capability we are looking for is server_side_thread_id

@pytest.mark.live_llm
@pytest.mark.django_db
@pytest.mark.skipif(not os.environ.get("OPENAI_API_KEY"), reason="OPENAI_API_KEY not set")
def test_three_turn_context_chain(live_user, course_key):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we want to tests for the way we call all providers. not just providers with server_side_id.

In the anthropic case, this test should still pass as our processor will construct the message from the local submissions and pass it. I think we should remove the limit of openai only and test that we pass it for all or rework a bit the test until we do.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, kudos for the existence of this test. I will become one of the critical bits of safety net we could have for the threaded orchestrator (learner chatbot).

a neutral turn 2 that does not reference it. Verifies that the server-
side thread correctly chains three consecutive turns.
"""
from openedx_ai_extensions.processors.llm.llm_processor import LLMProcessor # pylint: disable=C0415

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are importing this for every test, why not put it at the top?

)
session.refresh_from_db()

# Turn 1 — plant memorable fact (sent via previous_response_id)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not need to be sent via previous_response_id. As I'm pointing out, we also want to test this for every other provider.

Importantly, we are working towards having gemini and a openweights model in hg in this test suite pretty soon.



@pytest.mark.live_llm
@pytest.mark.skipif(not os.environ.get("ANTHROPIC_API_KEY"), reason="ANTHROPIC_API_KEY not set")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is something we want to test for any llm that supports the multi_turn_cache feature in _PROVIDER_CAPABILITIES



@pytest.mark.live_llm
@pytest.mark.skipif(not os.environ.get("ANTHROPIC_API_KEY"), reason="ANTHROPIC_API_KEY not set")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be the only test that is anthropic specific but I would not mind running this for other providers. A short message should not crash them with or without cache.


session = create_live_session(
live_user, course_key,
remote_response_id="resp_fake_expired_thread_id_xyz_000000",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does OpenAI return previous_response_not_found for this, or does it return a 400 (malformed ID) or some other error? A made-up format string might trigger a different error code entirely. If the recovery code only catches that specific code, the test wouldn't actually exercise it. We can set it to a thread we know has expired


session = create_live_session(live_user, course_key)

# Turn 0 — initialise the remote thread (system messages only; no user input

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test is relying on a side effect from a call that behaves incorrectly by design, and the phrase "current logic" means this silently changes if the logic changes.

This is a smell that is probably highlighting an conditional management from the underlying logic. Maybe we should fix the user_input reaching openai in the providers module where we do conditionals.

}

# First call — warms the cache
proc1 = LLMProcessor(config=config, user_session=MagicMock(remote_response_id=None))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file's own docstring says DB rows are used "so that session.save() exercises the actual persistence layer." But the Anthropic cache tests use MagicMock(remote_response_id=None), where save() silently disappears. If Anthropic's code path also calls session.save() (e.g., to update some state), the mock swallows it undetected. The inconsistency is unexplained.

Is there a key reason for this difference? can we have both providers (and the new providers that may come) exercise the exact same code path?

r2 = proc2.process(context=_LONG_SYSTEM_CONTEXT, input_data="Summarize this in one sentence.")
assert r2.get("status") == "success", f"Second call failed: {r2}"

usage = proc2.get_usage()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need to run get_usage here? it seems like something that would either requiere its own testing file for different providers or something that we ignore given that is litellm providing it and not something that is operation critical for us.

assert result2.get("status") == "success", f"Turn 2 failed: {result2}"
response_text = (result2.get("response") or "").lower()
assert len(response_text) > 5, "Turn 2 produced an empty response"
assert "python" in response_text, (

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"python" appears in DUMMY_CONTENT. Any call that includes DUMMY_CONTENT as context (including a completely fresh session with no recovery at all) would satisfy assert "python" in response_text.

We could ask something like, who is the creator of the programing language we are discussing and test for "guido".

@felipemontoya felipemontoya Jun 19, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this with claude and here is the response:

The assertion would pass even if the recovery code was completely broken and turn 2 had zero awareness of turn 1.

What would actually solve it is planting something in turn 1 that can't be inferred from the context or general knowledge — something the model can only recall if it has access to what was said in turn 1:

▎ Turn 1: "My student ID is ZEPHYR-9142. Just say 'Got it.'"
▎ Turn 2: "What is my student ID?"
▎ Assert "9142" in response

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would imply that we are not starting fresh but actually loading turns back into the thread from our db

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core contributor PR author is a Core Contributor (who may or may not have write access to this repo). open-source-contribution PR author is not from Axim or 2U

Projects

Status: Needs Triage

Development

Successfully merging this pull request may close these issues.

3 participants