Skip to main content

All Questions

Tagged with
0 votes
1 answer
298 views

Subprocess Error When Trying to Pip Install Fastparquet on Windows 10 & Python 3.13

I am trying to pip install Fastparquet and get the error below. I have searched but cannot find anything on this specific issue. I've tried running CMD as administrator but that does not help. I've ...
Robsmith's user avatar
  • 473
1 vote
1 answer
667 views

Python: OSError: [Errno 22] Invalid argument, when trying to use pandas.read_parquert

I have this simple code import pandas as pd file = pd.read_parquet('file.rot',engine='fastparquet') file.rot is a table of data (float numbers) with 5 columns When I run it the error that appears is ...
EsOj's user avatar
  • 13
0 votes
1 answer
167 views

Loading columnar-structured time-series data faster into a NumPy Arrays

Hi! Are there any ways to load large, (ideally) compressed, and columnar-structured data faster into NumPy arrays in Python? Considering common solutions such as Pandas, Apache Parquet/Feather and ...
user avatar
1 vote
0 answers
585 views

Unable to read Parquet file with PyArrow: Malformed levels

Assume that I am unable to change how the Parquet file is written, i.e. it is immutable and so we must find a way of reading it given the following complexities... In: import pandas as pd pd....
Tom Bomer's user avatar
  • 113
0 votes
1 answer
51 views

How to Handle Growing _metadata File Size and Avoid Corruption in Amazon Redshift Spectrum Parquet Append

Context: Our web application generates a lot of log files that arrive in an S3 bucket. The files in the bucket contain JSON strings and have a .txt file format. We process these files in chunks of 200 ...
Aakash's user avatar
  • 39
1 vote
2 answers
1k views

How could be possible to ignore non exist column from pandas read parquet function

I am trying to read parquet file through pandas, where a few columns do not exist in some files. I am wondering to know ignore the column existence check in read parquet function. def column_data(self)...
soft encoder's user avatar
0 votes
1 answer
103 views

asynchronous processing of data but sequential file save in multiprocessing

I'm processing really large log file - e.g. 300 GB and I have a script which chunk reads the file and asynchronously process the data (need to read some key:values from it) in pool of processes and ...
sarkafa's user avatar
0 votes
1 answer
1k views

Error converting column to bytes using encoding UTF8

I got below error when writing dask dataframe to S3. Couldn't figure out why. Does anybody know how to fix. dd.from_pandas(pred, npartitions=npart).to_parquet(out_path) The error is error.. Error ...
Justin Shan's user avatar
1 vote
2 answers
2k views

Unable to write parquet with DATE as logical type for a column from pandas

I am trying to write a parquet file which contains one date column having logical type in parquet as DATE and physical type as INT32. I am writing the parquet file using pandas and using fastparquet ...
Behroz Sikander's user avatar
0 votes
1 answer
99 views

Is there the best way to train binary classification with 1000 parquet files?

I'm training a binary classification model with a huge dataset in parquet format. However, it has a lot, I cannot fill all of the data into memory. Currently, I am doing like below but I'm facing out-...
Mason's user avatar
  • 27
0 votes
1 answer
180 views

Error installing tsflex on Mac: "Failed building wheel for fastparquet"

I've come across an issue while attempting to install the tsflex package on my Mac using pip3. After running pip3 install tsflex, I received the following error message: Collecting tsflex Using ...
Sira's user avatar
  • 11
0 votes
1 answer
595 views

parquet time stamp overflow with fastparquet/pyarrow

I have a parquet file I am reading from s3 using fastparquet/pandas , the parquet file has a column with date 2022-10-06 00:00:00 , and I see it is wrapping it as 1970-01-20 06:30:14.400, Please see ...
Bill's user avatar
  • 363
1 vote
1 answer
2k views

pyarrow timestamp datatype error on parquet file

I have this error when I read and count records in pandas using pyarrow, I do not want pyarrow to convert to timestamp[ns], it can keep in timestamp[us], is there an option to keep timestamp as is ?, ...
Bill's user avatar
  • 363
1 vote
2 answers
3k views

how to efficiently read pq files - Python

I have a list of files with .pq extension, whose names are stored in a list. My intention is to read these files, filter them based on pandas, and then merge them into a single pandas data frame. ...
sergey_208's user avatar
0 votes
1 answer
1k views

How can I query parquet files with the Polars Python API?

I have a .parquet file, and would like to use Python to quickly and efficiently query that file by a column. For example, I might have a column name in that .parquet file and want to get back the ...
SamTheProgrammer's user avatar

15 30 50 per page
1
2 3 4 5 6