1
\$\begingroup\$

I need to insert many million rows + many GB of data into a database for a project that uses Spring boot. I recreated a minimal example with a one to many relationship and am trying to find the fastest solution. The full code is here: https://github.com/Vuizur/springmassinsert, but the rough structure is:

// Email.java
@Entity
@Table(name = "emails")
public class Email {

    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;

    @Column(name = "address")
    private String address;

    @Column(name="text")
    private String text;

    @ManyToOne(fetch = FetchType.LAZY)
    @JoinColumn(name = "user_id")
    private User user;
}

// User.java
@Entity
@Table(name = "users")
public class User {

    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;

    @Column(name = "name")
    private String name;

    @OneToMany(mappedBy = "user", cascade = CascadeType.ALL, fetch = FetchType.LAZY)
    private List<Email> emails;
}

But the most important part is the insert code, where I tested 4 versions:


@Service
public class InsertService {

    @Autowired
    private JdbcTemplate jdbcTemplate;

    @Autowired
    private UserRepository userRepository;

    private final int NUMBER_OF_USERS = 10000;
    private final int NUMBER_OF_EMAILS = 10;

    public void insert() {
        List<User> users = new ArrayList<>();
        for (int i = 0; i < NUMBER_OF_USERS; i++) {
            User user = new User();
            user.setName("User " + i);
            List<Email> emails = new ArrayList<>();
            for (int j = 0; j < NUMBER_OF_EMAILS; j++) {
                Email email = new Email();
                email.setAddress("email" + j + "@gmail.com");
                email.setUser(user);
                emails.add(email);
            }
            user.setEmails(emails);
            users.add(user);
        }
        userRepository.saveAll(users);
    }

    public void insertBatch() {
        List<User> users = new ArrayList<>();
        for (int i = 0; i < NUMBER_OF_USERS; i++) {
            User user = new User();
            user.setName("User " + i);
            List<Email> emails = new ArrayList<>();
            for (int j = 0; j < NUMBER_OF_EMAILS; j++) {
                Email email = new Email();
                email.setAddress("email" + j + "@gmail.com");
                email.setUser(user);
                emails.add(email);
            }
            user.setEmails(emails);
            users.add(user);
            if (i % 1000 == 0) {
                userRepository.saveAll(users);
                users.clear();
            }
        }
    }

    public void insertJdbc() {
        List<User> users = new ArrayList<>();
        for (int i = 0; i < NUMBER_OF_USERS; i++) {
            User user = new User();
            user.setName("User " + i);
            List<Email> emails = new ArrayList<>();
            for (int j = 0; j < NUMBER_OF_EMAILS; j++) {
                Email email = new Email();
                email.setAddress("email" + j + "@gmail.com");
                email.setUser(user);
                emails.add(email);
            }
            user.setEmails(emails);
            users.add(user);
        }
        for (User user : users) {
            jdbcTemplate.update("insert into users (name) values (?)", user.getName());
            for (Email email : user.getEmails()) {
                jdbcTemplate.update("insert into emails (address, text, user_id) values (?, ?, ?)", email.getAddress(),
                        email.getText(), user.getId());
            }
        }
    }

    public void insertJdbcBatch() {
        List<User> users = new ArrayList<>();
        for (int i = 0; i < NUMBER_OF_USERS; i++) {
            User user = new User();
            user.setName("User " + i);
            List<Email> emails = new ArrayList<>();
            for (int j = 0; j < NUMBER_OF_EMAILS; j++) {
                Email email = new Email();
                email.setAddress("email" + j + "@gmail.com");
                email.setUser(user);
                emails.add(email);
            }
            user.setEmails(emails);
            users.add(user);
        }
            try {
            // Create a prepared statement for inserting users
            PreparedStatement userPs = jdbcTemplate.getDataSource().getConnection()
                    .prepareStatement("insert into users (name) values (?)", Statement.RETURN_GENERATED_KEYS);
            // Create a prepared statement for inserting emails
            PreparedStatement emailPs = jdbcTemplate.getDataSource().getConnection()
                    .prepareStatement("insert into emails (address, text, user_id) values (?, ?, ?)");
            for (User user : users) {
                // Set the user name parameter and add to the batch
                userPs.setString(1, user.getName());
                userPs.addBatch();
            }
            // Execute the batch update for users and get the generated ids
            userPs.executeBatch();
            ResultSet rs = userPs.getGeneratedKeys();
            int index = 0;
            while (rs.next()) {
                // Set the user id from the result set
                users.get(index).setId(rs.getLong(1));
                index++;
            }
            for (User user : users) {
                for (Email email : user.getEmails()) {
                    // Set the email parameters and add to the batch
                    emailPs.setString(1, email.getAddress());
                    emailPs.setString(2, email.getText());
                    emailPs.setLong(3, user.getId());
                    emailPs.addBatch();
                }
            }
            // Execute the batch update for emails
            emailPs.executeBatch();
        } catch (Exception e) {
            System.out.println(e);
        }

    }

    @PostConstruct
    public void benchmark() {
        long startTime = System.currentTimeMillis();
        insertJdbcBatch();
        long endTime = System.currentTimeMillis();
        System.out.println("Inserting " + NUMBER_OF_USERS + " users with " + NUMBER_OF_EMAILS + " emails each took "
                + (endTime - startTime) + " milliseconds");
        // Print insert per seconds
        System.out
                .println((NUMBER_OF_USERS * NUMBER_OF_EMAILS * 1.0 / ((endTime - startTime) / 1000.0)) + " inserts per second");
    }

}

The performance results are (on a bad external HDD):

Approach Inserts/second
JPA (insert) 4247
JPA with save all 1000 4401
JDBC 1842
JDBC with prepared statements 13005

So the difference between the first two versions is not significant, the naive JDBC version is really slow, but JDBC with prepared statements is the clear winner. Is there something else I can do to speed up the inserts even more?

\$\endgroup\$

1 Answer 1

3
\$\begingroup\$

It makes sense to me that JDBC with prepared statements would win that contest, as we don't have to keep re-parsing the same old SQL command.

Batch size before COMMIT tends to be pretty important. Increase it from "1 row at a time", which gives terrible throughput, to ten rows, a hundred, a thousand, maybe ten thousand. At some point you'll see that rows-per-second throughput stops increasing, remains stable as you bump up the batch size, and then actually goes down as your giant batches become counter productive. When sending across a TCP connection to a DB server, we want to be sending enough data that TCP gets to open the congestion window and send at full tilt. When storing to Winchester media, we want to send enough to keep the track buffer full, so it will write for a full spin, seek to adjacent track, and keep writing. With TCP the big batches hide end-to-end packet latency, and with HDD they hide track seek latency. (The effect is less pronounced with SSD, though still quite noticeable.)

Rule of thumb: we want to send batches that will keep the underlying storage subsystem busy with about 500 msec to 5 seconds worth of work. Sending a batch of just 10 msec of work tends to expose latency issues, and sending a batch that will take many seconds to flush out to stable storage is counterproductive because it uselessly fills RAM, messing with our buffer eviction strategy.

The other thing you can look at is you might choose to make your "upload the rows" operation as simple as possible. In particular, verifying constraints like UNIQUE, FOREIGN KEY, and even just producing several indexes on different columns, can induce random read I/Os. What we want is sequential I/O, with hardly any seeking. So one strategy, which at a minimum lets you break apart aggregate timings into per-system components, is to INSERT the rows into a scratch_user table, something like that. Be sure to COMMIT every couple of seconds. As a separate operation, do a giant INSERT ... SELECT FROM scratch_user, as a single transaction with explicit COMMIT at the end. This makes network effects disappear from the second transaction, and gives the query planner more flexibility to choose a good plan which involves sorting, enforcing uniqueness, and maybe table scanning for FK checks. Tacking on an ORDER BY will sometimes result in fuller table blocks, that is, with no randomly ordered INSERTs there was no need for B-tree page splits. But now we're getting into more subtle aspects.

set expectations

You reported write throughput of 13,000 rows/second, using single DB server and single disk drive.

What read throughput could you obtain? (Stop the database, unmount its filesystem, remount, restart, so you're not cheating w.r.t. cache hits when you make timing measurements.) You probably can't pull much more than 90,000 rows/second out of single server + single drive, right?

You can always scale out. Attach multiple drives, preferably as multiple tablespaces so the backend query planner can "see" what I/O resources are available. Double the number of cores on your DB server, if measurements show that you're CPU bound (often you won't be -- commonly it's I/O bound). Add more servers to your database cluster (though this tends to win more for concurrent reading than writing). Rent resources from a cloud vendor, so scale-out is limited only by your wallet.

Your write rate will never exceed your read rate, especially if we scribble each row into a transaction journal and into the destination table. So use "fastest SELECT time" to set reasonable expectations on what your equipment could possibly do, and don't deceive yourself by letting cache hits look like they're measuring disk read rates.

\$\endgroup\$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.