chore: add integration tests for threading#235
Conversation
|
Thanks for the pull request, @Henrrypg! This repository is currently maintained by Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review. 🔘 Get product approvalIf you haven't already, check this list to see if your contribution needs to go through the product review process.
🔘 Provide contextTo help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:
🔘 Get a green buildIf one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green. DetailsWhere can I find more information?If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources: When can I expect my changes to be merged?Our goal is to get community contributions seen and reviewed as efficiently as possible. However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:
💡 As a result it may take up to several weeks or months to complete a review and merge your PR. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #235 +/- ##
=======================================
Coverage 95.32% 95.32%
=======================================
Files 69 69
Lines 8086 8086
Branches 432 432
=======================================
Hits 7708 7708
Misses 283 283
Partials 95 95
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
felipemontoya
left a comment
There was a problem hiding this comment.
The tests look good to me, specially the third, but I think we should refactor them into using the provider_capabilities dict.
|
|
||
| @pytest.mark.live_llm | ||
| @pytest.mark.django_db | ||
| @pytest.mark.skipif(not os.environ.get("OPENAI_API_KEY"), reason="OPENAI_API_KEY not set") |
There was a problem hiding this comment.
When we set this specifically to OPENAI_API_KEY we are tying our understading of "which features are supported by which provider" to test case. We already have a way of mapping this in https://github.com/openedx/openedx-ai-extensions/blob/main/backend/openedx_ai_extensions/processors/llm/providers/__init__.py.
Can we make this skipif depend on the _PROVIDER_CAPABILITIES?
| """ | ||
| When session.remote_response_id points to a non-existent / expired | ||
| OpenAI thread, the processor must catch previous_response_not_found, | ||
| clear the stale ID, start a fresh thread, and return a valid response. |
There was a problem hiding this comment.
Should we start a completely fresh thread or prehaps re-load the conversation we already have in submissions?
|
|
||
| @pytest.mark.live_llm | ||
| @pytest.mark.django_db | ||
| @pytest.mark.skipif(not os.environ.get("OPENAI_API_KEY"), reason="OPENAI_API_KEY not set") |
There was a problem hiding this comment.
same, the capability we are looking for is server_side_thread_id
| @pytest.mark.live_llm | ||
| @pytest.mark.django_db | ||
| @pytest.mark.skipif(not os.environ.get("OPENAI_API_KEY"), reason="OPENAI_API_KEY not set") | ||
| def test_three_turn_context_chain(live_user, course_key): |
There was a problem hiding this comment.
This is something we want to tests for the way we call all providers. not just providers with server_side_id.
In the anthropic case, this test should still pass as our processor will construct the message from the local submissions and pass it. I think we should remove the limit of openai only and test that we pass it for all or rework a bit the test until we do.
There was a problem hiding this comment.
On the other hand, kudos for the existence of this test. I will become one of the critical bits of safety net we could have for the threaded orchestrator (learner chatbot).
| a neutral turn 2 that does not reference it. Verifies that the server- | ||
| side thread correctly chains three consecutive turns. | ||
| """ | ||
| from openedx_ai_extensions.processors.llm.llm_processor import LLMProcessor # pylint: disable=C0415 |
There was a problem hiding this comment.
if we are importing this for every test, why not put it at the top?
| ) | ||
| session.refresh_from_db() | ||
|
|
||
| # Turn 1 — plant memorable fact (sent via previous_response_id) |
There was a problem hiding this comment.
this does not need to be sent via previous_response_id. As I'm pointing out, we also want to test this for every other provider.
Importantly, we are working towards having gemini and a openweights model in hg in this test suite pretty soon.
|
|
||
|
|
||
| @pytest.mark.live_llm | ||
| @pytest.mark.skipif(not os.environ.get("ANTHROPIC_API_KEY"), reason="ANTHROPIC_API_KEY not set") |
There was a problem hiding this comment.
this is something we want to test for any llm that supports the multi_turn_cache feature in _PROVIDER_CAPABILITIES
|
|
||
|
|
||
| @pytest.mark.live_llm | ||
| @pytest.mark.skipif(not os.environ.get("ANTHROPIC_API_KEY"), reason="ANTHROPIC_API_KEY not set") |
There was a problem hiding this comment.
This might be the only test that is anthropic specific but I would not mind running this for other providers. A short message should not crash them with or without cache.
|
|
||
| session = create_live_session( | ||
| live_user, course_key, | ||
| remote_response_id="resp_fake_expired_thread_id_xyz_000000", |
There was a problem hiding this comment.
does OpenAI return previous_response_not_found for this, or does it return a 400 (malformed ID) or some other error? A made-up format string might trigger a different error code entirely. If the recovery code only catches that specific code, the test wouldn't actually exercise it. We can set it to a thread we know has expired
|
|
||
| session = create_live_session(live_user, course_key) | ||
|
|
||
| # Turn 0 — initialise the remote thread (system messages only; no user input |
There was a problem hiding this comment.
the test is relying on a side effect from a call that behaves incorrectly by design, and the phrase "current logic" means this silently changes if the logic changes.
This is a smell that is probably highlighting an conditional management from the underlying logic. Maybe we should fix the user_input reaching openai in the providers module where we do conditionals.
| } | ||
|
|
||
| # First call — warms the cache | ||
| proc1 = LLMProcessor(config=config, user_session=MagicMock(remote_response_id=None)) |
There was a problem hiding this comment.
The file's own docstring says DB rows are used "so that session.save() exercises the actual persistence layer." But the Anthropic cache tests use MagicMock(remote_response_id=None), where save() silently disappears. If Anthropic's code path also calls session.save() (e.g., to update some state), the mock swallows it undetected. The inconsistency is unexplained.
Is there a key reason for this difference? can we have both providers (and the new providers that may come) exercise the exact same code path?
| r2 = proc2.process(context=_LONG_SYSTEM_CONTEXT, input_data="Summarize this in one sentence.") | ||
| assert r2.get("status") == "success", f"Second call failed: {r2}" | ||
|
|
||
| usage = proc2.get_usage() |
There was a problem hiding this comment.
do we really need to run get_usage here? it seems like something that would either requiere its own testing file for different providers or something that we ignore given that is litellm providing it and not something that is operation critical for us.
| assert result2.get("status") == "success", f"Turn 2 failed: {result2}" | ||
| response_text = (result2.get("response") or "").lower() | ||
| assert len(response_text) > 5, "Turn 2 produced an empty response" | ||
| assert "python" in response_text, ( |
There was a problem hiding this comment.
"python" appears in DUMMY_CONTENT. Any call that includes DUMMY_CONTENT as context (including a completely fresh session with no recovery at all) would satisfy assert "python" in response_text.
We could ask something like, who is the creator of the programing language we are discussing and test for "guido".
There was a problem hiding this comment.
I tested this with claude and here is the response:
The assertion would pass even if the recovery code was completely broken and turn 2 had zero awareness of turn 1.
What would actually solve it is planting something in turn 1 that can't be inferred from the context or general knowledge — something the model can only recall if it has access to what was said in turn 1:
▎ Turn 1: "My student ID is ZEPHYR-9142. Just say 'Got it.'"
▎ Turn 2: "What is my student ID?"
▎ Assert "9142" in response
There was a problem hiding this comment.
This would imply that we are not starting fresh but actually loading turns back into the thread from our db
This PR add some missing integration tests for threading