Skip to content

refactor: Simplify query executor interface #1015

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Sep 26, 2024
Merged

Conversation

TrevorBergeron
Copy link
Contributor

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@TrevorBergeron TrevorBergeron requested review from a team as code owners September 24, 2024 01:04
@product-auto-label product-auto-label bot added the size: l Pull request size is large. label Sep 24, 2024
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. label Sep 24, 2024
Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactor! A few nits to resolve before merging, please.

Long-term, I'd like to see what we can do to consolidate some BQ -> pandas DataFrame logic to pandas-gbq, but since BigFrames has made slightly different choices with regards to returned pandas dtypes, I think this logic makes sense to retain right now.

@@ -487,9 +460,8 @@ def to_arrow(
list(self.value_columns) + list(self.index_columns)
)

_, query_job = self.session._execute(expr, ordered=ordered)
results_iterator = query_job.result()
pa_table = results_iterator.to_arrow()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: This does some automatic usage of the BQ Storage Read API in some cases: https://github.com/googleapis/python-bigquery/blob/ba99b12215995448998fccb6691423f4555a73bf/google/cloud/bigquery/table.py#L1699-L1736 for the existing logic of when we choose to do BQ Storage vs REST API.

A benefit of doing it ourselves though is that we can avoid creating a BQ Storage client (though that should be a pretty cheap operation, given that auth has already happened, and we can avoid it by passing in a BQ Storage client to to_arrow().)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we should always provide the read client to avoid repeated client creation overhead - but otherwise just let the client decide when to use the read api or not? Seems the bq python client is more likely to maintain good logic for choosing the read path than we are.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to just always providing the client. It seems that to_arrow() would generate its own client if needed, but to_arrow_iterable will only use a provided client, while still using old read path where its more efficient.

)
assert execute_result.total_bytes is not None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might this assertion be false with an empty DataFrame (i.e. no rows)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a few tests that produce empty results. Stepped through with a debugger and confirmed that total_rows=0 and total_bytes=0

)
assert execute_result.total_bytes is not None
table_size = execute_result.total_bytes / _BYTES_TO_MEGABYTES
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

table_mb maybe to reflect the units?

P.S. I'm trying to find some link to this convention. https://google.aip.dev/142#compatibility is somewhat relevant, though date/time specific.

Might have been this?

Its often a good idea to include units in a name. Using variable names like "distanceToTargetInCentimeters" Vs "distanceToTargetInInches" can avoid confusion. Some people would suggest using an underscore instead of the word "In" in such names. It acts as punctuation, separating the purpose of the variable from its units: "distanceToTarget_centimeters". It's Hungarian at a more abstract level (and with meaningful words instead of cryptic letters). -- DaveWhipp

https://wiki.c2.com/?HungarianNotation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to table_mb

@TrevorBergeron TrevorBergeron changed the title perf: Use read api for to_pandas when >1000 results refactor: Simplify query executor interface Sep 24, 2024
@@ -41,10 +48,20 @@
QUERY_COMPLEXITY_LIMIT = 1e7
# Number of times to factor out subqueries before giving up.
MAX_SUBTREE_FACTORINGS = 5

# A bytes limit would probably be more appropriate,
_READ_API_MIN_ROWS = 1000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constant is no longer used, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@TrevorBergeron TrevorBergeron merged commit c89e92e into main Sep 26, 2024
22 of 23 checks passed
@TrevorBergeron TrevorBergeron deleted the session_simplify branch September 26, 2024 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy