Spring Batch Example – Building a bulk contact importer

Spring Batch Example Tutorial Header

In today’s data-driven world, efficiently handling vast volumes of data is paramount. This often involves tasks like ETL processes, data migrations, or other batch operations. Faced with these challenges, the initial impulse might be to build a custom solution. However, specialized frameworks, like Spring Batch, are tailor-made for these use cases. Through this practical Spring Batch example, not only will we demonstrate setting up a bulk contact importer, but we will also guide you on how to monitor its progress in real time.

This tutorial provides you with an example CSV containing 100,000 fictitious contact records, by the end of this tutorial, you’ll have the knowledge to build a seamless integration into your own application, providing valuable feedback to your users during data-intensive operations.

What is Spring Batch?

Spring Batch is a comprehensive framework within the larger Spring ecosystem, designed specifically for batch processing – the kind of processing where you deal with vast amounts of data, transforming and transporting it from one system to another. While the core concept sounds simple, the intricacies and pitfalls of batch processing are numerous. Here’s where Spring Batch comes in, offering a robust and scalable solution, ensuring your batch processing tasks are efficient, reliable, and maintainable.

Why Choose Spring Batch Over a Custom Solution?

Before diving into our tutorial, let’s address a fundamental question — why use Spring Batch in the first place? One might initially think, “I know how to parse a CSV file. Why do I need a framework like Spring Batch?” It’s a fair question. Parsing a CSV is a basic task, but the devil is in the details, especially when scaling to large datasets and ensuring robustness in production scenarios. Let’s delve into some reasons why our “spring batch example” stands out:

  1. Scalability: Processing a few hundred records is simple. But what happens when you have a few hundred thousand, or millions? Spring Batch delivers high performance, efficiently processing vast record volumes, and ensures smooth operations regardless of dataset size.
  2. Fault Tolerance: In the real world, errors happen. Records can corrupt, systems can temporarily shut down, and unexpected issues can arise. Spring Batch offers automatic retries, skip, and rollback features. It handles failures gracefully, skips problematic records, and logs them for later review without stopping the entire process.
  3. Transactional Integrity: A crucial aspect often overlooked in custom solutions is transaction management. Spring Batch divides jobs into consistent chunks of records, committing each as a single transaction. This guarantees the integrity of your data even if certain chunks encounter issues.
  4. Parallel Processing: In an era of multi-core processors, parallelization isn’t a luxury; it’s a necessity for performance. Spring Batch provides built-in mechanisms for parallel processing, ensuring you harness the full power of modern hardware.
  5. Rich Monitoring and Logging: Knowledge is power. With Spring Batch’s detailed logging, metrics collection, and job repository, you’re always in the know. It’s not just about knowing if a job succeeded or failed but understanding its progress, bottlenecks, and performance metrics. In this tutorial, we create a REST endpoint that allows you to monitor the ongoing status of the import ‘Job’
  6. Reuse and Reduce: Why reinvent the wheel? With Spring Batch, many components are reusable. It reduces boilerplate, speeds up development, and ensures you’re using battle-tested components.
  7. Extensibility: Spring Batch addresses a wide range of use cases right out of the box, but it also offers design flexibility for extensions. If you have a unique requirement, you can extend Spring Batch to fit your needs.
  8. Integration Capabilities: In today’s interconnected systems, isolation isn’t an option. Spring Batch easily integrates with other systems, be it messaging queues, databases, or external services, ensuring your batch processing is in harmony with your ecosystem.

By choosing Spring Batch, you’re not just opting for a tool to parse and process data. You’re investing in a comprehensive solution designed for real-world challenges, ensuring robustness, scalability, and maintainability. In this tutorial, we’ll not only set up an importer but also demonstrate how to monitor it in real time, empowering you to build user interfaces that keep users informed.

Tutorial: Building a Bulk Contact Importer with Spring Batch

Having understood the advantages of Spring Batch, let’s delve into a hands-on tutorial. In this guide, we are going to use Spring Boot along with Spring Batch to build a bulk contact importer, a practical example demonstrating how to use Spring Batch for importing a large CSV file of 100,000 contacts into a database. We will use Spring Boot to provide an endpoint to start the import process and a second endpoint to monitor the status of the import. This is to simulate the common scenario where a front-end application would allow a user to first choose a file, and submit a form to begin an import process.

You can find the full source code for this tutorial here.

Prerequisites:

  • Java 17
  • Spring Boot 3
  • Gradle
  • Familiarity with Spring Boot and basic database operations.

Project setup

As with the majority of our tutorials, we will be using IntelliJ IDEA as our IDE, although you can use whatever you are familiar with. As a result, your interface may differ.

We will start using Spring Initializer to create our project. We will be using Java 17, and Gradle as our build tool.

spring batch example tutorial project initialization part 1

Then for our dependencies, select:

  • Spring Boot Devtools
  • Lombok
  • Spring Web
  • Spring Batch
  • Spring Data JPA
  • H2 Database

Click Create as per our image below.

Spring batch example tutorial project setup dependency selection

Sample Contact Data

As we will require sample data for the import, you can use the sample file from the repository for this tutorial containing 100,000 contacts. Download the file from the following link and save it to your src/main/resources folder.
https://github.com/tucanoo/spring-batch-example-contacts-importer/blob/main/src/main/resources/100k_sample_contacts.csv

If saved correctly, your project folder structure should look as the image below:

spring batch example tutorial project folder structure

For the purpose of the tutorial it is important NOT to edit this file, or if you want to provide your own CSV data, then ensure the header row contains the same column headers

firstName, lastName, gender, email, phone, address, occupation, website

Data model

Our tutorial centers on importing sample Contact data. Thus, we need an appropriate data entity and a corresponding repository.

Create a new class named Contact under src/main/java/data/entities and add the necessary fields to the class. Note we’re using Lombok to save on the otherwise significant boilerplate code:

@Entity
@Getter
@Setter
public class Contact {
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;

    private String firstName;
    private String lastName;
    private String gender;
    private String email;
    private String phone;
    private String address;
    private String occupation;
    private String website;
}

Now create a JPA data repository class to allow us to persist our entities. Create the class src/main/src/java/data/repositories/ContactRepository and add the following code:

public interface ContactRepository extends JpaRepository<Contact, Long> {
}

There’s only one thing left for the data persistence side and that is to configure a connection to the in-memory H2 database that we will use.

Add the following configuration to src/main/resources/application.properties

# In memory DB datasource
spring.datasource.url=jdbc:h2:mem:testdb
spring.datasource.driver-class-name=org.h2.Driver
spring.datasource.username=sa
spring.jpa.hibernate.ddl-auto=update
spring.h2.console.enabled=true

Spring Batch Configuration

As with many features provided to us by the Spring framework, the way we work with the feature is largely through configuration. So for this tutorial, we will create a configuration class specifically for the job of importing contacts. In real-life practice, our projects may have many configuration files reflecting the different types of data we are importing.

Start with creating a new class named src/main/java/data/importconfig/contacts/BatchImportConfigForContacts and add the code below:

@Configuration
@RequiredArgsConstructor
public class BatchImportConfigForContacts {

    private final EntityManagerFactory entityManagerFactory;

    /**
     * Creates and returns a {@link FlatFileItemReader} bean for reading CSV records.
     * The reader uses job parameters to determine the file path at runtime.
     *
     * @param path the file path provided at runtime through job parameters.
     * @return a configured FlatFileItemReader for reading Contact entities.
     */
    @Bean
    @StepScope
    public FlatFileItemReader<Contact> reader(@Value("#{jobParameters['filePath']}") String path) {
        return new FlatFileItemReaderBuilder<Contact>()
            .name("personItemReader")
            .resource(new FileSystemResource(path))
            .linesToSkip(1)  // skip header row
            .delimited()
            .names(new String[]{"firstName", "lastName", "gender", "email", "phone", "address", "occupation", "website"})
            .fieldSetMapper(new BeanWrapperFieldSetMapper<Contact>() {{
                setTargetType(Contact.class);
            }})
            .build();
    }

    /**
     * Defines the main batch job for importing contacts.
     *
     * @param jobRepository the repository for storing job metadata.
     * @param step1 the step associated with this job.
     * @return a configured Job for importing contacts.
     */
    @Bean
    public Job importContactsJob(JobRepository jobRepository, Step step1)  {
        return new JobBuilder("importContactsJob", jobRepository)
            .start(step1)
            .build();
    }

    /**
     * Creates and returns a {@link JpaItemWriter} bean for persisting Contact entities.
     *
     * @return a configured JpaItemWriter for writing Contact entities.
     */
    @Bean
    public JpaItemWriter<Contact> writer() {
        JpaItemWriter<Contact> writer = new JpaItemWriter<>();
        writer.setEntityManagerFactory(entityManagerFactory);
        return writer;
    }

    /**
     * Defines the main batch step which includes reading, processing (if any), and writing.
     *
     * @param jobRepository the repository for storing job metadata.
     * @param transactionManager the transaction manager to handle transactional behavior.
     * @return a configured Step for reading and writing Contact entities.
     */
    @Bean
    public Step step1(JobRepository jobRepository, PlatformTransactionManager transactionManager) {
        return new StepBuilder("step1", jobRepository)
            .<Contact, Contact>chunk(1000, transactionManager)
            .reader(reader(null))  // null path just for type resolution
            .writer(writer())
            .build();
    }
}

Let’s take a look at this class more closely since this is the engine behind our entire import functionality.

Spring Batch Configuration Class Breakdown

reader(@Value("#{jobParameters['filePath']}") String path)

This method configures a FlatFileItemReader bean to read records from a CSV file.

The @Value("#{jobParameters['filePath']}") annotation uses Spring Expression Language (SpEL) to dynamically inject the file path provided at runtime as a job parameter, allowing different runs of the job to process different files without altering the batch configuration.

  • @StepScope: It enables the late binding of parameters, such as the filePath, which is passed dynamically when the job runs. We delay the Bean’s creation until the step starts because we want to inject dynamic values, like the job parameters, during runtime.
  • new FileSystemResource(path): Specifies the file’s location to be read.
  • .linesToSkip(1): Ensures the header row of the CSV is skipped to prevent it from being treated as data.
  • .delimited(): We indicate the file uses a delimiter to separate its contents, and by default, it’s a comma for CSV files.
  • .names(...): Lists the column names as they appear in the CSV. This is essential for mapping data accurately to our Contact entity.
  • BeanWrapperFieldSetMapper: It maps the fields from the CSV to our Contact entity.

importContactsJob(JobRepository jobRepository, Step step1)

This method defines the main batch job.

  • The JobBuilder is utilized to configure a Job. A Job in Spring Batch is the entire batch process you intend to run.
  • .start(step1): The batch job commences with the step named “step1”.

writer()

This method sets up the writer configuration. A Spring Batch writer is responsible for persisting processed data, often to databases, files, or other external systems, ensuring efficient bulk operations and transaction management. We are injecting our entityManagerFactory into this class so we can provide our JPA connection directly to the writer.

  • JpaItemWriter: A specialized writer from Spring Batch that uses JPA to persist data into a database. In this case, it’s used to save our Contact entities.
  • .setEntityManagerFactory(entityManagerFactory): Provides the writer with an EntityManagerFactory to facilitate database operations using JPA.

step1(JobRepository jobRepository, PlatformTransactionManager transactionManager)

This method creates and configures the primary step of our batch process.

Often, you might reference a ‘processor’ at this stage if you need to perform special data processing, like transformations. For instance, you might want to ensure the ‘website’ data has a correctly formatted HTTP or HTTPS prefix, or you might want to add a ‘fullName’ field that combines the last and first names.

  • StepBuilder: Helps in creating a “Step", which can be seen as a phase within a Job. During each step, data is read, possibly processed, and then written.
  • .chunk(1000, transactionManager): Configures the step to work with chunks of data. In this setup, the system processes 1000 records at once. The transaction manager guarantees safe transactional processing for these chunks. The performance of Spring Batch often hinges on the transaction size, so you should experiment with and adjust it based on your specific needs.
  • .reader(reader(null)): Incorporates the reader we defined above. The null serves purely for type resolution and is replaced by the actual path when the job runs.
  • .writer(writer()): Integrates the writer we set up above.

By understanding the methods in BatchImportConfigForContacts, one can grasp the lifecycle of the batch process — from reading data in chunks to processing and finally writing them into a database.

By default, Spring Batch jobs will launch at application startup, which is not appropriate for this example use-case as we want to launch the job when requested via a URL call. So we need to add an additional configuration item to our application.properties to disable this behaviour. Add the following content to src/main/resources/application.properties:

# Disable batch from starting jobs at startup
spring.batch.job.enabled=false

Starting and monitoring the Spring Batch job

In reality, your application includes its own user interface, whether web or desktop, or it may even monitor a file location for incoming data, launching a new Spring Batch job when it detects new data to load. In this example, however, We are going to provide a simple RestController with two endpoints;

A means to initialise the import job, and a means to monitor the running job.

This simulates the common workflow where a user navigates an application, selects, and uploads a file from their local workstation, launching the import job upon upload. We will provide a /status endpoint allowing us to view in real-time the number of records processed, the number of failures, the number of commits it has made to the DB, and an up-to-date count of records in our database.

As we also know our dataset contains 100,000 records, we can also calculate the progress. In real life, you may also include such an endpoint so your front-end application can poll this URL and display a nice progress bar and information back to the user.

Create a new class src/main/java/controllers/ContactImportController and include the following code:

@RestController
@RequestMapping("/importExample")
@RequiredArgsConstructor
public class ContactImportController {
    private final JobLauncher jobLauncher;
    private final Job importContactsJob;
    private final JobExplorer jobExplorer;
    private final ContactRepository contactRepository;

    /**
     * Endpoint to start the contacts import batch job.
     * Simulates a user uploading a CSV file of contacts.
     *
     * @return Response indicating if the batch job was invoked successfully.
     * @throws Exception if any error occurs during job launch.
     */
    @GetMapping("/start")
    public ResponseEntity<String> handle() throws Exception {

        // simulate the user uploading a CSV file of contacts to this controller endpoint
        ClassPathResource sampleContactsData = new ClassPathResource("100k_sample_contacts.csv");
        String pathToResource = sampleContactsData.getFile().getAbsolutePath();

        JobParameters params = new JobParametersBuilder()
            .addString("filePath", pathToResource)
            .addString("JobID", String.valueOf(System.currentTimeMillis()))
            .toJobParameters();
        jobLauncher.run(importContactsJob, params);

        return ResponseEntity.ok().body("Batch job has been invoked");
    }

    /**
     * Endpoint to fetch the current status of the contacts import batch job.
     * Provides insights like job status, number of records read/written, progress percentage, etc.
     * Also verifies the number of records in the Contacts table by calling our repositories count() function
     *
     * @return Response with status and metrics related to the batch job.
     */
    @GetMapping("/status")
    public ResponseEntity<Map<String, Object>> getJobStatus() {
        Map<String, Object> response = new HashMap<>();

        List<JobInstance> instances = jobExplorer.getJobInstances("importContactsJob", 0, 1);
        if (instances.isEmpty()) {
            response.put("message", "No job instance found");
            return ResponseEntity.status(HttpStatus.NOT_FOUND).body(response);
        }

        List<JobExecution> jobExecutions = jobExplorer.getJobExecutions(instances.get(0));

        if (jobExecutions.isEmpty()) {
            response.put("message", "No job execution found");
            return ResponseEntity.status(HttpStatus.NOT_FOUND).body(response);
        }

        JobExecution lastJobExecution = jobExecutions.get(0);
        for (JobExecution jobExecution : jobExecutions) {
            if (jobExecution.getCreateTime().isAfter(lastJobExecution.getCreateTime())) {
                lastJobExecution = jobExecution;
            }
        }

        BatchStatus batchStatus = lastJobExecution.getStatus();
        response.put("status", batchStatus.toString());

        Collection<StepExecution> stepExecutions = lastJobExecution.getStepExecutions();
        for (StepExecution stepExecution : stepExecutions) {
            // In our case, there's only one step. If you have multiple steps, you might want to key by step name.
            response.put("readCount", stepExecution.getReadCount());
            response.put("writeCount", stepExecution.getWriteCount());
            response.put("commitCount", stepExecution.getCommitCount());
            response.put("skipCount", stepExecution.getSkipCount());
            response.put("rollbackCount", stepExecution.getRollbackCount());
            response.put("contactsInDB", contactRepository.count());

            // Progress indicator. Assuming you know the total records in advance (100,000 in this case).
            int progress = (int) (((double) stepExecution.getReadCount() / 100000) * 100);
            response.put("progress", progress + "%");
        }

        return ResponseEntity.ok().body(response);
    }

}

Here you can immediately see several injected fields that we use throughout the controller:

  • JobLauncher:
    • It’s used to start the importContactsJob with specific JobParameters.
  • Job:
    • Represents the main job configured to import contacts. It’s invoked using the jobLauncher.
  • JobExplorer:
    • Used to fetch the status and other details of previously executed or currently running instances of the importContactsJob. This allows the controller to provide updates and insights into the progress and status of the job.
  • ContactRepository:
    • Used to perform operations related to the Contact entity, like counting the number of contacts in the database, which gives an idea of the progress of the ongoing import job.

Let’s cover the two endpoints just one last time before we attempt to run the application.

Endpoint “/start”

This endpoint initialises the contacts import batch job. When invoked, it simulates the scenario of a user uploading a CSV file filled with contact data. Behind the scenes, the CSV data file named “100k_sample_contacts.csv” is located, and its absolute path is fetched. Subsequently, a unique job ID is generated using the current system time in milliseconds.

These details, i.e., the file path and the unique job ID, are set as parameters for the importContactsJob. The job is then launched using the jobLauncher.

Once the batch job gets successfully invoked, the endpoint responds with a message: “Batch job has been invoked”.

Endpoint “/status”

This endpoint aims to provide the current status and various metrics associated with the contacts import batch job. With the latest execution, it extracts key metrics like batch status, read count, write count, commit count, skip count, rollback count, and even the total number of contacts currently in the database . Furthermore, it calculates a progress indicator as a percentage based on the number of records read and the total number expected.

The progress, along with other metrics, is returned as a response to the user, allowing them to understand how far the job has proceeded and if any issues (like skipped or rolled back records) have occurred

Together, these two endpoints provide a simple yet effective mechanism to not only start the contact import process but also monitor its progress in real time.

Running our Spring Batch Example tutorial

At this point, you should be able to build and run the example application. Attempt to run the application and if you encounter any issues you don’t understand, please compare your code with that in our repository.

If your application starts, attempt to call the initial endpoint to start the import process at: http://localhost:8080/importExample/start

After a moment, you should see the following in your browser.

spring batch example tutorial job start

Now you should be able to check on the status of the job at the status URL: http://localhost:8080/importExample/status

Refreshing the page you should be able to see the values incrementing as the import job progresses.

spring batch example job monitoring status

Ultimately resulting in a “COMPLETED” status:

spring batch example job completed

Conclusion

Throughout this tutorial, we’ve walked you through a practical spring batch example, demonstrating the steps to set up a bulk contact importer. We’ve dived deep into the configuration, explored the intricacies of the batch process, and learned how to keep track of its progress. Implementing such a solution from scratch would have been tedious and error-prone. But, with Spring Batch, we’ve efficiently streamlined the process.

Whether you’re aiming to integrate this into your existing project or simply trying to grasp the concepts, we hope this guide has been enlightening. We encourage you to adapt and expand upon this foundation, tailoring it to your specific needs.

We hope this Spring Batch example tutorial has proven useful, as a reminder, you can find the full source code in our repository, And please do not hesitate to contact us if you require any assistance with your Spring Boot development requirements.

Founder of Tucanoo Solutions Ltd, a Cloud / Web Application development company. AWS Cloud Solutions Architect. Specialties: Spring Boot, Java, Grails, React.JS, App Architecture, Agile, Scrum, Git, AWS, Javascript.

Leave a Reply

Your email address will not be published. Required fields are marked *