Skip to main content
2 votes
1 answer
106 views

I’m working with an Iceberg table in Impala named customer_fact, partitioned by the column created_at. The table contains duplicate rows based on customer_id, and I want to retain only the latest ...
Norah's user avatar
  • 21
1 vote
0 answers
35 views

we are running a sparklyr job that runs queries on Cloudera CDP Hive cluster. The job sometimes stops before a dbWriteTable function, doing nothing and running indefinitely. The job doesn't always ...
lrovere's user avatar
  • 11
0 votes
2 answers
73 views

I'm setting up a CML session with 64GB of RAM and 4 CPUs, then I set up a PySpark session with these configurations spark = SparkSession.builder \ .appName("OptimizedSparkSession") \ ...
Perkūns's user avatar
0 votes
3 answers
144 views

Could someone explain what is faster - loading a table with an SQL query and filtering it within the table or loading the full table and filtering it outside with PySpark functions? For example, this ...
Perkūns's user avatar
1 vote
1 answer
67 views

I am stuck with an error with the encoding of non-ascii characters from a FlowFile content, in NiFi. I am processing the text with an ExecuteScript processor using Jython. The flow is a simple ...
alex's user avatar
  • 13
0 votes
0 answers
45 views

Question: I am working in Cloudera Data Science Workbench (CDSW) and have created a virtual environment named "testenv". I started a session and activated my virtual environment using: ...
Bini Yoni's user avatar
0 votes
0 answers
37 views

NiFi's EncryptContent processor throws "Can't use an RSA_SIGN key for encryption" error. I tried both .gpg & .asc key file formats.
Sam's user avatar
  • 21
0 votes
2 answers
144 views

I want to programmatically retrieve the name of the script used in the current job that runs a python script on the Cloudera ML platform. __file__ magic variable doesn't work as in the background our ...
Mischa Lisovyi's user avatar
0 votes
1 answer
55 views

I have the following table in SQL: ID CreatedDate OldValue NewValue 1 18/11/2024 13:05:10 Open Escalated 1 18/11/2024 14:05:10 Escalated With Customer 1 18/11/2024 16:05:10 With Customer Closed 2 20/...
MahdiJ's user avatar
  • 1
0 votes
1 answer
45 views

I created Cloudera cluster on AWS by this instruction https://docs.cloudera.com/cdp-public-cloud/cloud/getting-started/topics/cdp-deploy_cdp_using_terraform.html and these Terraform scripts https://...
VladS's user avatar
  • 4,356
0 votes
1 answer
90 views

We have enterprise hadoop cluster installed on linux servers in our organisation. I am trying to insert csv file into one of our hive tables. I have csv file in my local windows machine. I am using ...
Pavan Sai Aravala's user avatar
0 votes
1 answer
53 views

I have a requirement to gather run duration (time) for the last 3 months, for a particular airflow job. In our CDE environment we use airflow to call spark DBT jobs, of late the run duration of job ...
Anil_468's user avatar
0 votes
0 answers
265 views

I've been trying to send data from Kafka to Snowflake using the JDBC driver with Kafka Connect. Some details about the environment: Kafka is running in a Cloudera private cluster (Base 7.1.9). The ...
alex's user avatar
  • 13
1 vote
2 answers
98 views

I had try to insert data to Cloudera/Hive using SSIS.Connection I used from SSIS to Cloudera using ODBC. I got an issue when execute the task, the script generated for insert including double ...
angga_sbs's user avatar
1 vote
2 answers
797 views

I'm trying to access an Impala DB via SQLAlchemy - I have configured a DSN that allows me to connect to the DB when using directly pyodbc. However when using SQLAlchemy I get an error: When using a db ...
ErnstW's user avatar
  • 33

15 30 50 per page
1
2 3 4 5
167