-
Notifications
You must be signed in to change notification settings - Fork 48
refactor: Simplify query executor interface #1015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
407f9cb
to
248d063
Compare
248d063
to
13e8f0a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice refactor! A few nits to resolve before merging, please.
Long-term, I'd like to see what we can do to consolidate some BQ -> pandas DataFrame logic to pandas-gbq, but since BigFrames has made slightly different choices with regards to returned pandas dtypes, I think this logic makes sense to retain right now.
@@ -487,9 +460,8 @@ def to_arrow( | |||
list(self.value_columns) + list(self.index_columns) | |||
) | |||
|
|||
_, query_job = self.session._execute(expr, ordered=ordered) | |||
results_iterator = query_job.result() | |||
pa_table = results_iterator.to_arrow() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: This does some automatic usage of the BQ Storage Read API in some cases: https://github.com/googleapis/python-bigquery/blob/ba99b12215995448998fccb6691423f4555a73bf/google/cloud/bigquery/table.py#L1699-L1736 for the existing logic of when we choose to do BQ Storage vs REST API.
A benefit of doing it ourselves though is that we can avoid creating a BQ Storage client (though that should be a pretty cheap operation, given that auth has already happened, and we can avoid it by passing in a BQ Storage client to to_arrow()
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we should always provide the read client to avoid repeated client creation overhead - but otherwise just let the client decide when to use the read api or not? Seems the bq python client is more likely to maintain good logic for choosing the read path than we are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to just always providing the client. It seems that to_arrow() would generate its own client if needed, but to_arrow_iterable will only use a provided client, while still using old read path where its more efficient.
) | ||
assert execute_result.total_bytes is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might this assertion be false with an empty DataFrame (i.e. no rows)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a few tests that produce empty results. Stepped through with a debugger and confirmed that total_rows=0 and total_bytes=0
bigframes/core/blocks.py
Outdated
) | ||
assert execute_result.total_bytes is not None | ||
table_size = execute_result.total_bytes / _BYTES_TO_MEGABYTES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
table_mb
maybe to reflect the units?
P.S. I'm trying to find some link to this convention. https://google.aip.dev/142#compatibility is somewhat relevant, though date/time specific.
Might have been this?
Its often a good idea to include units in a name. Using variable names like "distanceToTargetInCentimeters" Vs "distanceToTargetInInches" can avoid confusion. Some people would suggest using an underscore instead of the word "In" in such names. It acts as punctuation, separating the purpose of the variable from its units: "distanceToTarget_centimeters". It's Hungarian at a more abstract level (and with meaningful words instead of cryptic letters). -- DaveWhipp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to table_mb
bigframes/session/executor.py
Outdated
@@ -41,10 +48,20 @@ | |||
QUERY_COMPLEXITY_LIMIT = 1e7 | |||
# Number of times to factor out subqueries before giving up. | |||
MAX_SUBTREE_FACTORINGS = 5 | |||
|
|||
# A bytes limit would probably be more appropriate, | |||
_READ_API_MIN_ROWS = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constant is no longer used, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
a20479f
to
92001c5
Compare
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕