Skip to content

[Fix](Cloud)decouple min pipeline executor size from ConnectContext#60884

Open
CalvinKirs wants to merge 1 commit intoapache:masterfrom
CalvinKirs:master-batch-modle-cluster-name
Open

[Fix](Cloud)decouple min pipeline executor size from ConnectContext#60884
CalvinKirs wants to merge 1 commit intoapache:masterfrom
CalvinKirs:master-batch-modle-cluster-name

Conversation

@CalvinKirs
Copy link
Member

@CalvinKirs CalvinKirs commented Feb 27, 2026

Background

#60648

getMinPipelineExecutorSize in cloud paths implicitly depended on ConnectContext, which caused two issues:

  1. Unstable behavior when no thread-local context is available (e.g. internal/async paths).
  2. Unclear API semantics since callers could not explicitly specify the target cluster.

This PR makes the API explicit by requiring clusterName.

What Changed

  1. Removed the no-arg getMinPipelineExecutorSize() API and kept only: getMinPipelineExecutorSize(String clusterName).
  2. Unified SystemInfoService and CloudSystemInfoService implementations to the string-arg API.
  3. Updated SessionVariable#getParallelExecInstanceNum() to call the string-arg API, with cluster resolved from session/auth information (instead of directly depending on ConnectContext).
  4. Added synchronization in ConnectContext:

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

 ## Background

  getMinPipelineExecutorSize in cloud paths implicitly depended on ConnectContext, which caused two issues:

  1. Unstable behavior when no thread-local context is available (e.g. internal/async paths).
  2. Unclear API semantics since callers could not explicitly specify the target cluster.

  This PR makes the API explicit by requiring clusterName.

  ## What Changed

  1. Removed the no-arg getMinPipelineExecutorSize() API and kept only:
     getMinPipelineExecutorSize(String clusterName).
  2. Unified SystemInfoService and CloudSystemInfoService implementations to the string-arg API.
  3. Updated SessionVariable#getParallelExecInstanceNum() to call the string-arg API, with cluster resolved from session/auth information (instead of directly depending on ConnectContext).
  4. Added synchronization in ConnectContext:
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@CalvinKirs
Copy link
Member Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 29046 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit b2f5a0f80828a311a102adccbd75d99cb4b3197d, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17614	4415	4294	4294
q2	q3	10651	789	576	576
q4	4680	359	289	289
q5	7569	1217	1026	1026
q6	170	175	144	144
q7	807	840	667	667
q8	9307	1471	1336	1336
q9	4854	4784	4749	4749
q10	6828	1879	1645	1645
q11	495	252	244	244
q12	700	633	473	473
q13	17807	4221	3426	3426
q14	235	232	225	225
q15	952	794	785	785
q16	766	729	674	674
q17	726	844	434	434
q18	5929	5343	5302	5302
q19	1576	968	610	610
q20	506	515	384	384
q21	4858	1951	1460	1460
q22	401	317	303	303
Total cold run time: 97431 ms
Total hot run time: 29046 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4665	4548	4572	4548
q2	q3	1796	2233	1839	1839
q4	854	1191	779	779
q5	4095	4324	4327	4324
q6	186	183	139	139
q7	1746	1602	1496	1496
q8	2479	2744	2557	2557
q9	7302	7289	7228	7228
q10	2620	2910	2496	2496
q11	505	471	424	424
q12	512	599	447	447
q13	4072	4506	3584	3584
q14	283	298	276	276
q15	853	821	794	794
q16	719	745	720	720
q17	1220	1570	1317	1317
q18	7592	6756	6817	6756
q19	903	876	898	876
q20	2078	2184	2043	2043
q21	3970	3477	3386	3386
q22	494	435	411	411
Total cold run time: 48944 ms
Total hot run time: 46440 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 183665 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit b2f5a0f80828a311a102adccbd75d99cb4b3197d, data reload: false

query5	4501	623	529	529
query6	331	223	216	216
query7	4211	485	270	270
query8	345	255	244	244
query9	8734	2752	2720	2720
query10	517	386	340	340
query11	16859	17453	17041	17041
query12	200	132	139	132
query13	1423	545	366	366
query14	7415	3399	3118	3118
query14_1	2911	2913	2904	2904
query15	209	202	178	178
query16	1051	502	487	487
query17	1125	779	642	642
query18	2773	478	448	448
query19	272	270	200	200
query20	138	134	132	132
query21	215	140	128	128
query22	5738	5380	4842	4842
query23	17248	16860	16571	16571
query23_1	16577	16809	16713	16713
query24	7137	1616	1217	1217
query24_1	1229	1254	1240	1240
query25	573	489	427	427
query26	1249	311	144	144
query27	2709	484	299	299
query28	4438	1873	1873	1873
query29	799	560	470	470
query30	312	242	208	208
query31	876	735	643	643
query32	81	68	73	68
query33	528	338	285	285
query34	908	908	562	562
query35	639	670	602	602
query36	1096	1117	971	971
query37	137	91	83	83
query38	2988	2895	2851	2851
query39	881	861	852	852
query39_1	809	843	835	835
query40	224	153	133	133
query41	60	59	60	59
query42	106	106	100	100
query43	365	382	351	351
query44	
query45	198	190	182	182
query46	872	973	594	594
query47	2146	2138	2059	2059
query48	307	308	226	226
query49	621	464	390	390
query50	716	274	212	212
query51	4066	4094	4067	4067
query52	106	108	95	95
query53	286	333	291	291
query54	289	291	265	265
query55	92	93	80	80
query56	350	314	312	312
query57	1366	1364	1261	1261
query58	289	281	282	281
query59	2525	2688	2570	2570
query60	341	338	328	328
query61	161	158	147	147
query62	599	581	544	544
query63	314	273	278	273
query64	4889	1288	989	989
query65	
query66	1407	453	354	354
query67	16350	16267	16245	16245
query68	
query69	394	320	288	288
query70	1003	964	877	877
query71	340	313	291	291
query72	2764	2677	2465	2465
query73	532	540	317	317
query74	9964	9900	9751	9751
query75	2833	2774	2469	2469
query76	2306	1015	677	677
query77	356	389	306	306
query78	11381	11491	10722	10722
query79	1140	799	604	604
query80	1368	634	545	545
query81	566	283	252	252
query82	1018	145	117	117
query83	330	274	241	241
query84	247	122	101	101
query85	910	485	438	438
query86	441	305	300	300
query87	3080	3081	3005	3005
query88	3546	2665	2637	2637
query89	423	381	335	335
query90	2042	176	170	170
query91	167	170	132	132
query92	78	73	70	70
query93	983	813	508	508
query94	656	323	309	309
query95	592	349	313	313
query96	657	510	227	227
query97	2521	2475	2407	2407
query98	227	221	217	217
query99	996	986	921	921
Total cold run time: 254427 ms
Total hot run time: 183665 ms

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 62.50% (15/24) 🎉
Increment coverage report
Complete coverage report

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the “min pipeline executor size” lookup to no longer implicitly depend on thread-local ConnectContext in cloud mode, making the target cluster explicit and reducing NPE risk in async/internal call paths (as seen in #60648).

Changes:

  • Removes the no-arg getMinPipelineExecutorSize() and standardizes on getMinPipelineExecutorSize(String clusterName).
  • Updates SessionVariable#getParallelExecInstanceNum() to use per-session user/cluster resolution instead of directly relying on ConnectContext.
  • Adjusts cloud/non-cloud implementations and updates/extends related unit tests.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
fe/fe-core/src/main/java/org/apache/doris/system/SystemInfoService.java Changes API to string-arg method (non-cloud implementation still scans all backends).
fe/fe-core/src/main/java/org/apache/doris/cloud/system/CloudSystemInfoService.java Removes ConnectContext dependency; returns default when cluster name is empty.
fe/fe-core/src/main/java/org/apache/doris/qe/SessionVariable.java Introduces transient qualifiedUser and explicit cloud-cluster resolution for auto parallelism.
fe/fe-core/src/main/java/org/apache/doris/qe/ConnectContext.java Propagates qualified user and selected cloud cluster into SessionVariable.
fe/fe-core/src/test/java/org/apache/doris/system/SystemInfoServiceTest.java Updates calls to use the new string-arg API.
fe/fe-core/src/test/java/org/apache/doris/cloud/system/CloudSystemInfoServiceTest.java Adapts tests to the new API and adds explicit-cluster test coverage.
Comments suppressed due to low confidence (1)

fe/fe-core/src/main/java/org/apache/doris/system/SystemInfoService.java:1086

  • The new clusterName parameter is currently unused in the non-cloud SystemInfoService implementation (it always scans all backends via getAllBackendsByAllCluster()). To avoid misleading API semantics, add Javadoc clarifying that the parameter is ignored in non-cloud mode (kept only for CloudSystemInfoService override) or consider an overload/renaming that makes this explicit.
    public int getMinPipelineExecutorSize(String clusterName) {
        List<Backend> currentBackends = null;
        try {
            currentBackends = getAllBackendsByAllCluster().values().asList();
        } catch (UserException e) {

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1026 to +1031
try {
clusterName = context.getCloudCluster(false);
} catch (Exception e) {
return 1;
}
return infoService.getMinPipelineExecutorSize(clusterName);
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helper getMinPipelineExecutorSizeByContext() catches a broad Exception and returns 1, which can mask unexpected failures (e.g., NPEs) in tests that assert the result is 1. Prefer catching the expected ComputeGroupException (or the specific checked exception thrown by getCloudCluster(false)) and let other exceptions fail the test.

Copilot uses AI. Check for mistakes.
Comment on lines 4182 to +4218
public int getParallelExecInstanceNum() {
ConnectContext connectContext = ConnectContext.get();
if (connectContext != null && connectContext.getEnv() != null && connectContext.getEnv().getAuth() != null) {
int userParallelExecInstanceNum = connectContext.getEnv().getAuth()
.getParallelFragmentExecInstanceNum(connectContext.getQualifiedUser());
Env currentEnv = Env.getCurrentEnv();
if (!Strings.isNullOrEmpty(qualifiedUser)
&& currentEnv != null
&& currentEnv.getAuth() != null) {
int userParallelExecInstanceNum = currentEnv.getAuth()
.getParallelFragmentExecInstanceNum(qualifiedUser);
if (userParallelExecInstanceNum > 0) {
return userParallelExecInstanceNum;
}
}
if (parallelPipelineTaskNum == 0) {
int size = Env.getCurrentSystemInfo().getMinPipelineExecutorSize();
int size = Env.getCurrentSystemInfo().getMinPipelineExecutorSize(resolveCloudClusterForAutoParallel());
int autoInstance = (size + 1) / 2;
return Math.min(autoInstance, maxInstanceNum);
} else {
return parallelPipelineTaskNum;
}
}

public void setQualifiedUser(String qualifiedUser) {
this.qualifiedUser = qualifiedUser;
}

private String resolveCloudClusterForAutoParallel() {
if (!Config.isCloudMode()) {
return "";
}
if (!Strings.isNullOrEmpty(cloudCluster)) {
return cloudCluster;
}
Env currentEnv = Env.getCurrentEnv();
if (currentEnv == null || currentEnv.getAuth() == null || Strings.isNullOrEmpty(qualifiedUser)) {
return "";
}
return Strings.nullToEmpty(currentEnv.getAuth().getDefaultCloudCluster(qualifiedUser));
}
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getParallelExecInstanceNum() now depends on the new per-session transient qualifiedUser plus resolveCloudClusterForAutoParallel() logic, including behavior when ConnectContext thread-local is absent. There are no existing unit tests covering this method, so regressions here (auth override precedence, default-cluster resolution, empty-cluster fallback) are currently untested; please add focused tests for these new branches.

Copilot uses AI. Check for mistakes.
Comment on lines 651 to 666
@@ -658,6 +661,9 @@ public SessionVariable getSessionVariable() {

public void setSessionVariable(SessionVariable sessionVariable) {
this.sessionVariable = sessionVariable;
if (this.sessionVariable != null) {
this.sessionVariable.setQualifiedUser(getQualifiedUser());
}
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description mentions "Added synchronization in ConnectContext", but the changes here only propagate qualifiedUser/cloudCluster into SessionVariable and do not introduce any synchronization. If synchronization is still required for correctness, it appears missing; otherwise please update the PR description to avoid confusion.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants