I often find it useful to stash .csv files in a Google Cloud Storage (GCS) bucket to be accessed from a Vertex JupyterLabs notebook.
Once the file(s) are accessible from Vertex, I can do all sorts of things with it.
One of the most common operations I perform in this scenario is to use the data in the .csv to create a Pandas dataframe that can in turn be used to enrich other data that I have stored in a BigQuery implementation.
import pandas as pd
from google.cloud import storage
from io import BytesIO
def generate_df(bucket_name_input: str, files: list) -> pd.DataFrame:
client = storage.Client()
bucket_name = bucket_name_input
bucket = client.get_bucket(bucket_name)
df = pd.DataFrame()
for file in files:
# this construct assumes the files are similarly structured
blob = bucket.get_blob(file)
content = blob.download_as_string()
file_df = pd.read_csv(BytesIO(content))
df = df.append(file_df)
return(df)