refactor: Simplify query executor interface #1015

TrevorBergeron · 2024-09-24T01:04:40Z

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

tswast

Nice refactor! A few nits to resolve before merging, please.

Long-term, I'd like to see what we can do to consolidate some BQ -> pandas DataFrame logic to pandas-gbq, but since BigFrames has made slightly different choices with regards to returned pandas dtypes, I think this logic makes sense to retain right now.

tswast · 2024-09-24T18:34:01Z

bigframes/core/blocks.py

@@ -487,9 +460,8 @@ def to_arrow(
            list(self.value_columns) + list(self.index_columns)
        )

-        _, query_job = self.session._execute(expr, ordered=ordered)
-        results_iterator = query_job.result()
-        pa_table = results_iterator.to_arrow()


FYI: This does some automatic usage of the BQ Storage Read API in some cases: https://github.com/googleapis/python-bigquery/blob/ba99b12215995448998fccb6691423f4555a73bf/google/cloud/bigquery/table.py#L1699-L1736 for the existing logic of when we choose to do BQ Storage vs REST API.

A benefit of doing it ourselves though is that we can avoid creating a BQ Storage client (though that should be a pretty cheap operation, given that auth has already happened, and we can avoid it by passing in a BQ Storage client to to_arrow().)

It seems that we should always provide the read client to avoid repeated client creation overhead - but otherwise just let the client decide when to use the read api or not? Seems the bq python client is more likely to maintain good logic for choosing the read path than we are.

Switched to just always providing the client. It seems that to_arrow() would generate its own client if needed, but to_arrow_iterable will only use a provided client, while still using old read path where its more efficient.

tswast · 2024-09-24T18:42:48Z

bigframes/core/blocks.py

        )
+        assert execute_result.total_bytes is not None


Might this assertion be false with an empty DataFrame (i.e. no rows)?

We have a few tests that produce empty results. Stepped through with a debugger and confirmed that total_rows=0 and total_bytes=0

tswast · 2024-09-24T18:51:06Z

bigframes/core/blocks.py

        )
+        assert execute_result.total_bytes is not None
+        table_size = execute_result.total_bytes / _BYTES_TO_MEGABYTES


table_mb maybe to reflect the units?

P.S. I'm trying to find some link to this convention. https://google.aip.dev/142#compatibility is somewhat relevant, though date/time specific.

Might have been this?

Its often a good idea to include units in a name. Using variable names like "distanceToTargetInCentimeters" Vs "distanceToTargetInInches" can avoid confusion. Some people would suggest using an underscore instead of the word "In" in such names. It acts as punctuation, separating the purpose of the variable from its units: "distanceToTarget_centimeters". It's Hungarian at a more abstract level (and with meaningful words instead of cryptic letters). -- DaveWhipp

https://wiki.c2.com/?HungarianNotation

changed to table_mb

tswast · 2024-09-24T20:40:08Z

bigframes/session/executor.py

@@ -41,10 +48,20 @@
 QUERY_COMPLEXITY_LIMIT = 1e7
 # Number of times to factor out subqueries before giving up.
 MAX_SUBTREE_FACTORINGS = 5
-
+# A bytes limit would probably be more appropriate,
+_READ_API_MIN_ROWS = 1000


This constant is no longer used, right?

TrevorBergeron requested review from a team as code owners September 24, 2024 01:04

TrevorBergeron requested a review from sycai September 24, 2024 01:04

product-auto-label bot added the size: l Pull request size is large. label Sep 24, 2024

blunderbuss-gcf bot assigned jiaxunwu Sep 24, 2024

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. label Sep 24, 2024

TrevorBergeron force-pushed the session_simplify branch from 407f9cb to 248d063 Compare September 24, 2024 01:34

TrevorBergeron requested a review from tswast September 24, 2024 01:34

perf: Use read api for to_pandas when >1000 results

13e8f0a

TrevorBergeron force-pushed the session_simplify branch from 248d063 to 13e8f0a Compare September 24, 2024 15:45

fix issue with executor head return type

b396c75

tswast approved these changes Sep 24, 2024

View reviewed changes

TrevorBergeron added 2 commits September 24, 2024 19:58

always provide read client

d22bbe8

fix test param making network calls

716418a

TrevorBergeron requested a review from tswast September 24, 2024 20:11

TrevorBergeron changed the title ~~perf: Use read api for to_pandas when >1000 results~~ refactor: Simplify query executor interface Sep 24, 2024

tswast reviewed Sep 24, 2024

View reviewed changes

TrevorBergeron added 2 commits September 25, 2024 01:18

fix empty batch iterator case

f5d5414

Merge remote-tracking branch 'github/main' into session_simplify

92001c5

TrevorBergeron force-pushed the session_simplify branch from a20479f to 92001c5 Compare September 25, 2024 01:18

TrevorBergeron enabled auto-merge (squash) September 25, 2024 01:19

Make ExecuteResult reusable and store schema

b9fedb9

TrevorBergeron disabled auto-merge September 25, 2024 05:41

TrevorBergeron added 3 commits September 25, 2024 05:42

fix old arrow_batches references

5b690f3

fix array field nullability

16d09cf

fix cutop field def

6802c70

TrevorBergeron requested a review from tswast September 25, 2024 18:31

fix test expectation nullable list

ae7b5a0

TrevorBergeron merged commit c89e92e into main Sep 26, 2024
22 of 23 checks passed

TrevorBergeron deleted the session_simplify branch September 26, 2024 21:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: Simplify query executor interface #1015

refactor: Simplify query executor interface #1015

Uh oh!

TrevorBergeron commented Sep 24, 2024

Uh oh!

tswast left a comment

Uh oh!

tswast Sep 24, 2024

Uh oh!

TrevorBergeron Sep 24, 2024

Uh oh!

TrevorBergeron Sep 24, 2024

Uh oh!

tswast Sep 24, 2024

Uh oh!

TrevorBergeron Sep 24, 2024

Uh oh!

tswast Sep 24, 2024

Uh oh!

TrevorBergeron Sep 24, 2024

Uh oh!

tswast Sep 24, 2024

Uh oh!

TrevorBergeron Sep 25, 2024

Uh oh!

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

refactor: Simplify query executor interface #1015

refactor: Simplify query executor interface #1015

Uh oh!

Conversation

TrevorBergeron commented Sep 24, 2024

Uh oh!

tswast left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.