4,715 questions
Advice
1
vote
2
replies
134
views
How the Computer Handles Interrupts
What is the difference between an interrupt and a context switch?
I understand the concept of an interrupt and how it occurs. However, I'm digging deeper into the topic.
I studied Computer ...
3
votes
1
answer
146
views
How to catch EXCEPTION_PRIV_INSTRUCTION from RDPMC directly in Assembly (and without SEH)?
I'm experimenting with measuring CPU's instructions latency and throughput on P and E cores using RDPMC on Win 11, something like that:
MOV ECX, 0x40000000 ; Instructions Counter
RDPMC ; Read ...
0
votes
1
answer
64
views
Cache Allocation Technology in 13th Generation Core i9 13900E Intel CPU [closed]
I am trying to implement Cache allocation Technology`s impact with my CPU. However, when I use either lscpu to see whether my CPU supports, or cpuid -l 0x10, output is false.
How is this possible?
How ...
1
vote
1
answer
88
views
Randomness instructions vs syscalls [closed]
I've been digging into "true" randomness idea, and I've noticed that modern CPUs support instructions for generating randomness. X64 has RDRAND instruction, while ARM has RNDR (I'm not ...
1
vote
1
answer
108
views
Is CPU multithreading effected by divergence?
Building on this question here
The term thread divergence is used in CUDA; from my understanding it's a situation where different threads are assigned to do different tasks and this results in a big ...
0
votes
1
answer
285
views
How to handle "Could not initialize NNPACK! Reason: Unsupported hardware" warning in PyTorch / Silero VAD on cloud CPU?
I’m running Silero VAD (via PyTorch + torchaudio) on a Linode cloud instance (2 dedicated CPUs, 4 GB RAM). When I process 10-minute audio chunks, I always get repeated warnings like this and it doesn'...
7
votes
1
answer
226
views
Why are all IMUL µOPs dispatched to Port 1 only (on Haswell), even when multiple IMULs are executed in parallel?
I'm experimenting with the IMUL r64, r64 instruction on an Intel Xeon E5-1620 v3 (Haswell architecture, base clock 3.5 GHz, turbo boost up to 3.6 GHz, Hyper Threading is enabled).
My test loop is ...
2
votes
0
answers
71
views
Need to do CPU profiling of Jruby application
Need to do CPU profiling for Jruby application (jruby version : 1.7.20.1-8) which uses ruby version (1.9.3).
I tried using default profiler but getting below error due to version compatibility issue ...
0
votes
1
answer
52
views
Fargate Cloudwatch CPU Utilisation differs from docker stats
Looking at the CPUUtilized Cloudwatch metric for my Fargate service, it's showing max cpu units used as 1040 over the past 4 weeks, using a sampling period of 1 minute. I have 4 vCPUs provisioned to ...
0
votes
1
answer
170
views
Performance regression in a Kubernetes deployment that does not occur locally [closed]
I have a docker image and an EC2. When I run this image on my EC2, it takes x seconds to finish. When I run the app natively, it also takes x seconds.
But if I deploy the exact image in a container in ...
2
votes
0
answers
209
views
Why does floating point division take less than 50% of the latency of integer division and also 10x more latency than usual when underflow occurs?
I am measuring the latency of instructions.
For 64-bit primitives, integer division takes about 25 cycles each, usually on my 2.3GHz Digital Ocean vCPU, while floating point division takes about 10 ...
0
votes
0
answers
71
views
Why must align memory address
Memory addresses must be aligned before they are used. I know that if they are not, performance costs more in CPU caching. I discovered that certain processors raise exceptions when unaligned memories ...
-3
votes
1
answer
110
views
Understanding when a hazard in MIPS occurs
I have a question regarding these two instructions:
lw r2, 10(r1)
lw r1, 10(r2)
Is there a hazard here, do I need stalls in between two of them?
I want to know if any kind of hazard happens here? I ...
1
vote
0
answers
43
views
How to optimize CPU tensor slicing and asynchronous transfer to the GPU?
My code involves slicing large tensors on the CPU by index and asynchronously transmitting them back to the GPU. However, through the Profiler debugging tool, I found that this step would seriously ...
1
vote
0
answers
85
views
popcnt instruction not as fast as loop on core ultra 155h [duplicate]
I think the title says it all: i have implemented a popcnt function that counts bits as a loop with shifts and one with inline asm with the actual cpu instruction.
This is my c code:
#define ...