Skip to content

feat(python/sedonadb): Ensure dataframes have meaningful default alias#935

Closed
paleolimbot wants to merge 7 commits into
apache:mainfrom
paleolimbot:python-df-edges
Closed

feat(python/sedonadb): Ensure dataframes have meaningful default alias#935
paleolimbot wants to merge 7 commits into
apache:mainfrom
paleolimbot:python-df-edges

Conversation

@paleolimbot

Copy link
Copy Markdown
Member

This PR ensures all dataframes when created have a default alias that is probably unique and mostly meaningful (e.g., the filename being read). This ensures that joins don't require manual aliasing in most cases.

Motivating use case:

import sedona.db

sd = sedona.db.connect()

pts = sd.read_parquet(
    "https://github.com/geoarrow/geoarrow-data/releases/download/v0.2.0/ns-water_water-point.parquet"
)
ply = sd.read_parquet(
    "https://github.com/geoarrow/geoarrow-data/releases/download/v0.2.0/ns-water_water-poly.parquet"
)
pts.join(ply, on=sd.funcs.st_intersects(pts.geometry, ply.geometry)).select(
    pt_hid=pts.HID, pl_hid=ply.HID, geometry=pts.geometry
).show(5)
# ┌────────────────────────────────┬────────────────────────────────┬────────────────────────────────┐
# │             pt_hid             ┆             pl_hid             ┆            geometry            │
# │              utf8              ┆              utf8              ┆            geometry            │
# ╞════════════════════════════════╪════════════════════════════════╪════════════════════════════════╡
# │ FC9A29123BEF4A6588A2FD777B81A… ┆ 63E9CC5AE7144EEABEBEC2C95C91F… ┆ POINT Z(258498.62729999982 48… │
# ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ 28E47E22D71549ED91472AD59A9D8… ┆ 63E9CC5AE7144EEABEBEC2C95C91F… ┆ POINT Z(258498.32830000017 48… │
# ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ B10B26FA32FB4438925D86CD612BA… ┆ 63E9CC5AE7144EEABEBEC2C95C91F… ┆ POINT Z(258502.0273000002 481… │
# ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ 6ACD71128B6B492C8177B11A5C2D2… ┆ 63E9CC5AE7144EEABEBEC2C95C91F… ┆ POINT Z(258498.92729999963 48… │
# ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ 0A4BE2AB03D8459B8279FCC4BDE55… ┆ 63E9CC5AE7144EEABEBEC2C95C91F… ┆ POINT Z(258526.62729999982 48… │
# └────────────────────────────────┴────────────────────────────────┴────────────────────────────────┘

A downside here is that the explain plans get more verbose because they always print out the qualifier.

Closes #926.

@github-actions github-actions Bot requested a review from prantogg June 9, 2026 20:51

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds automatic, meaningful default aliases to Python DataFrames at creation time (especially for file-based reads), reducing ambiguous column reference errors in joins without requiring users to manually call .alias() in common cases.

Changes:

  • Add _ensure_aliased() + default alias generation to automatically qualify DataFrame columns on creation.
  • Apply default aliasing to Context.read_parquet(), Context.read_pyogrio(), Context.read_format(), and Context.sql().
  • Add/adjust tests to assert alias behavior and relax a brittle explain-plan output assertion.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
python/sedonadb/python/sedonadb/dataframe.py Introduces _ensure_aliased() and default-alias generation for newly created DataFrames.
python/sedonadb/python/sedonadb/context.py Applies automatic aliasing to read helpers and sql().
python/sedonadb/src/dataframe.rs Special-cases aliasing for EXPLAIN logical plans to keep sd.sql("EXPLAIN ...") executable.
python/sedonadb/tests/test_dataframe.py Updates tests around DataFrame creation from DataFrame and relaxes a brittle explain assertion.
python/sedonadb/tests/test_context.py Removes the old test_read_parquet (now covered under io tests).
python/sedonadb/tests/io/test_parquet.py Adds read_parquet alias assertions and keeps existing parquet behavior checks.
python/sedonadb/tests/io/test_pyogrio.py Adds an alias assertion when reading a path via pyogrio.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread python/sedonadb/python/sedonadb/context.py
Comment thread python/sedonadb/python/sedonadb/context.py
Comment thread python/sedonadb/python/sedonadb/context.py
Comment thread python/sedonadb/python/sedonadb/dataframe.py Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Comment thread python/sedonadb/python/sedonadb/dataframe.py
@paleolimbot paleolimbot marked this pull request as ready for review June 11, 2026 02:44
@paleolimbot

Copy link
Copy Markdown
Member Author

@jiayuasu Do you have bandwidth to review this one since it's mostly for joins?

@jiayuasu

jiayuasu commented Jun 13, 2026

Copy link
Copy Markdown
Member

Two things give me pause about this mechanism:

  1. It doesn't actually solve self-joins. Because the alias is baked from
    the source object at create time, a self-join gets the same qualifier on
    both sides. I tried the predicate path from your example:
  df = sd.create_data_frame(...).alias("foo_deadbeef")   # one default
  alias
  df.join(df, on=df["k"] == df["k"])
  1. The id()-based suffix is non-deterministic. the qualifier changes every run

@paleolimbot

Copy link
Copy Markdown
Member Author

Got it...we can manually alias and/or add spatial join helpers later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(python/sedonadb): Automatically give nameless dataframes an alias on create

3 participants