Skip to content

feat: Experimental transpilation of unannotated python callables#17419

Open
TrevorBergeron wants to merge 8 commits into
mainfrom
tbergeron_transpile_path
Open

feat: Experimental transpilation of unannotated python callables#17419
TrevorBergeron wants to merge 8 commits into
mainfrom
tbergeron_transpile_path

Conversation

@TrevorBergeron

Copy link
Copy Markdown
Contributor

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an experimental Python transpiler feature, allowing standard Python callables to be transpiled and executed within DataFrame.map, DataFrame.apply, Series.apply, and Series.combine. It replaces the func_to_op helper with func_to_expr, which generates a new CallableExpression to manage argument binding. Feedback on the changes highlights critical validation gaps in both DataFrame.apply and CallableExpression.apply where missing checks for unexpected keyword arguments, duplicate arguments, or extra positional arguments could lead to silent correctness bugs or cryptic errors.

Comment thread packages/bigframes/bigframes/dataframe.py Outdated
Comment thread packages/bigframes/bigframes/operations/to_op.py Outdated
@TrevorBergeron TrevorBergeron requested a review from tswast June 16, 2026 00:22
@TrevorBergeron TrevorBergeron marked this pull request as ready for review June 16, 2026 00:22
@TrevorBergeron TrevorBergeron requested review from a team as code owners June 16, 2026 00:22
@tswast

tswast commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

System test looks like it might indicate a real error, but probably unrelated to this change:

___________________ test_read_parquet_gcs[bigquery_wildcard] ___________________
[gw19] linux -- Python 3.12.12 /tmpfs/src/github/google-cloud-python/packages/bigframes/.nox/system/bin/python

session = 
scalars_dfs = (          bool_col                                          bytes_col  \
rowindex                                    ...7             True  ...  0 days 00:00:00.000004
8            False  ...         5 days 00:00:00

[9 rows x 14 columns])
gcs_folder = 'gs://bigframes-dev-testing/bigframes_tests_system_20260616000152_914011/'
engine = 'bigquery', filename = '*.parquet'

    @pytest.mark.parametrize(
        ("engine", "filename"),
        (
            pytest.param(
                "auto",
                "000000000000.parquet",
                id="auto",
            ),
            pytest.param(
                "pyarrow",
                "000000000000.parquet",
                id="pyarrow",
            ),
            pytest.param(
                "bigquery",
                "000000000000.parquet",
                id="bigquery",
            ),
            pytest.param(
                "bigquery",
                "*.parquet",
                id="bigquery_wildcard",
            ),
            pytest.param(
                "auto",
                "*.parquet",
                id="auto_wildcard",
                marks=pytest.mark.xfail(
                    raises=ValueError,
                ),
            ),
        ),
    )
    def test_read_parquet_gcs(
        session: bigframes.Session, scalars_dfs, gcs_folder, engine, filename
    ):
        scalars_df, _ = scalars_dfs
        # Include wildcard so that multiple files can be written/read if > 1 GB.
        # https://cloud.google.com/bigquery/docs/exporting-data#exporting_data_into_one_or_more_files
        write_path = gcs_folder + test_read_parquet_gcs.__name__ + "*.parquet"
        read_path = gcs_folder + test_read_parquet_gcs.__name__ + filename
    
        df_in: bigframes.dataframe.DataFrame = scalars_df.copy()
        # GEOGRAPHY not supported in parquet export.
        df_in = df_in.drop(columns="geography_col")
        # Make sure we can also serialize the order.
        df_write = df_in.reset_index(drop=False)
        df_write.index.name = f"ordering_id_{random.randrange(1_000_000)}"
        df_write.to_parquet(write_path, index=True)
    
        df_out = (
            session.read_parquet(read_path, engine=engine)
            # Restore order.
>           .set_index(df_write.index.name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            .sort_index()
            # Restore index.
            .set_index(typing.cast(str, df_in.index.name))
        )

tests/system/small/test_session.py:1916: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
bigframes/core/logging/log_adapter.py:183: in wrapper
    return method(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self =     ordering_id_430420  rowindex  bool_col  \
0                          4     False   
2                     ...                  432000000000  
10                                  432000000000  
...

[36 rows x 15 columns]
keys = ('ordering_id_463089',), append = False, drop = True

    def set_index(
        self,
        keys: typing.Union[blocks.Label, typing.Sequence[blocks.Label]],
        append: bool = False,
        drop: bool = True,
    ) -> DataFrame:
        if not utils.is_list_like(keys):
            keys = typing.cast(typing.Sequence[blocks.Label], (keys,))
        else:
            keys = typing.cast(typing.Sequence[blocks.Label], tuple(keys))
        col_ids = [self._resolve_label_exact(key) for key in keys]
        missing = [keys[i] for i in range(len(col_ids)) if col_ids[i] is None]
        if len(missing) > 0:
>           raise KeyError(f"None of {missing} are in the columns")
E           KeyError: "None of ['ordering_id_463089'] are in the columns"

bigframes/dataframe.py:2419: KeyError

CC oncall @sycai for visibility

@tswast tswast left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! A few comments.

"Python transpiler is an unstable, experimental feature, and not yet fully "
"validated, use at your own risk."
)
warnings.warn(msg, category=bfe.PreviewWarning)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I like to make custom exceptions the subclass from PreviewWarning for more explicit opt-in, but probably overkill in retrospect.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


name: str
default_value: typing.Any
is_varargs: bool

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to track keyword-only and/or kwargs dictionary separately? Or maybe that's not really inferrable from the Python AST?

)
)

from bigframes.core.bytecode import dis_to_expr

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"dis" is hard for me to understand without context. Could we use a more descriptive name?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to py_to_expression

bindings: typing.Mapping[ids.ColumnId, ex.Expression],
allow_partial_bindings: bool = False,
) -> CallableExpression:
return dataclasses.replace(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These bind and transform functions are pretty similar to the other implementations, right? Maybe we can do some sort of mixin class to implement these?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDK about specific impl (mixin or otherwise), but yeah, not that have expanded from 3 expression subclasses (op, literal, reference) to several more, things are getting redundant.

The logic is actually a bit custom per expression type, but it could maybe just be derived from dataclass field metadata, as most of it amounts to descending through fields (known statically) that subclass expression and returning/replacing them.

Created internal issue 525095335 to track a refactor here.

Comment on lines +4752 to +4754
for col in self.columns:
if col in expr.free_variables:
col_id = block.resolve_label_exact(col)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we expect any trouble from mixing the high-level (dataframe columns) representation and lower level (block column labels) representation?

Also, this looks relatively familiar, such as in our bbq.sql_scalar implementation. Perhaps there's some shared utilities we can refactor for this column mapping logic?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is to cross that label to id boundary. The python function references series attributes which are labels, but we want to map those to unambiguous block ids.

The implementation here is a bit messy though, refactored it in new revision as apply_to_block_rows

@@ -4692,13 +4692,17 @@ def _prepare_export(
return array_value, id_overrides

def map(self, func, na_action: Optional[str] = None) -> DataFrame:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to see an example in the docstrings, especially if this is compatible with the polars engine and thus would make for a relatively speedy flakeless doctest.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added doctests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants