All Questions
Tagged with fastparquet python
88 questions
0
votes
1
answer
298
views
Subprocess Error When Trying to Pip Install Fastparquet on Windows 10 & Python 3.13
I am trying to pip install Fastparquet and get the error below. I have searched but cannot find anything on this specific issue. I've tried running CMD as administrator but that does not help. I've ...
1
vote
1
answer
667
views
Python: OSError: [Errno 22] Invalid argument, when trying to use pandas.read_parquert
I have this simple code
import pandas as pd
file = pd.read_parquet('file.rot',engine='fastparquet')
file.rot is a table of data (float numbers) with 5 columns
When I run it the error that appears is ...
0
votes
1
answer
167
views
Loading columnar-structured time-series data faster into a NumPy Arrays
Hi! Are there any ways to load large, (ideally) compressed, and columnar-structured data faster into NumPy arrays in Python? Considering common solutions such as Pandas, Apache Parquet/Feather and ...
1
vote
0
answers
585
views
Unable to read Parquet file with PyArrow: Malformed levels
Assume that I am unable to change how the Parquet file is written, i.e. it is immutable and so we must find a way of reading it given the following complexities...
In:
import pandas as pd
pd....
0
votes
1
answer
51
views
How to Handle Growing _metadata File Size and Avoid Corruption in Amazon Redshift Spectrum Parquet Append
Context:
Our web application generates a lot of log files that arrive in an S3 bucket.
The files in the bucket contain JSON strings and have a .txt file format. We process these files in chunks of 200 ...
1
vote
2
answers
1k
views
How could be possible to ignore non exist column from pandas read parquet function
I am trying to read parquet file through pandas, where a few columns do not exist in some files.
I am wondering to know ignore the column existence check in read parquet function.
def column_data(self)...
0
votes
1
answer
103
views
asynchronous processing of data but sequential file save in multiprocessing
I'm processing really large log file - e.g. 300 GB and I have a script which chunk reads the file and asynchronously process the data (need to read some key:values from it) in pool of processes and ...
0
votes
1
answer
1k
views
Error converting column to bytes using encoding UTF8
I got below error when writing dask dataframe to S3. Couldn't figure out why. Does anybody know how to fix.
dd.from_pandas(pred, npartitions=npart).to_parquet(out_path)
The error is
error.. Error ...
1
vote
2
answers
2k
views
Unable to write parquet with DATE as logical type for a column from pandas
I am trying to write a parquet file which contains one date column having logical type in parquet as DATE and physical type as INT32. I am writing the parquet file using pandas and using fastparquet ...
0
votes
1
answer
99
views
Is there the best way to train binary classification with 1000 parquet files?
I'm training a binary classification model with a huge dataset in parquet format. However, it has a lot, I cannot fill all of the data into memory. Currently, I am doing like below but I'm facing out-...
0
votes
1
answer
180
views
Error installing tsflex on Mac: "Failed building wheel for fastparquet"
I've come across an issue while attempting to install the tsflex package on my Mac using pip3. After running pip3 install tsflex, I received the following error message:
Collecting tsflex
Using ...
0
votes
1
answer
595
views
parquet time stamp overflow with fastparquet/pyarrow
I have a parquet file I am reading from s3 using fastparquet/pandas , the parquet file has a column with date 2022-10-06 00:00:00 , and I see it is wrapping it as 1970-01-20 06:30:14.400, Please see ...
1
vote
1
answer
2k
views
pyarrow timestamp datatype error on parquet file
I have this error when I read and count records in pandas using pyarrow, I do not want pyarrow to convert to timestamp[ns], it can keep in timestamp[us], is there an option to keep timestamp as is ?, ...
1
vote
2
answers
3k
views
how to efficiently read pq files - Python
I have a list of files with .pq extension, whose names are stored in a list. My intention is to read these files, filter them based on pandas, and then merge them into a single pandas data frame.
...
0
votes
1
answer
1k
views
How can I query parquet files with the Polars Python API?
I have a .parquet file, and would like to use Python to quickly and efficiently query that file by a column.
For example, I might have a column name in that .parquet file and want to get back the ...