Quantcast
Channel: End Point Blog
Viewing all 1118 articles
Browse latest View live

Serialization and Deserialization Issues in Spring REST

$
0
0

Mosaic pattern

Photo by Annie Spratt

Spring Boot projects primarily use the JSON library Jackson to serialize and deserialize objects. It is especially useful that Jackson automatically serializes objects returned from REST APIs and deserializes complex type parameters like @RequestBody.

In a Spring Boot project the automatically registered MappingJackson2HttpMessageConverter is usually enough and makes JSON conversions simple, but this may have some issues which need custom configuration. Let’s go over a few good practices for them.

Configuring a Custom Jackson ObjectMapper

In Spring REST projects a custom implementation of MappingJackson2HttpMessageConverter helps to create the custom ObjectMapper, as seen below. Whatever custom implementation you need to add to the custom ObjectMapper can be handled by this custom converter:

public class CustomHttpMessageConverter extends MappingJackson2HttpMessageConverter {

    private ObjectMapper initCustomObjectMapper() {
        ObjectMapper customObjectMapper = new ObjectMapper();
        return customObjectMapper;
    }

    // ...
}

Additionally, some MappingJackson2HttpMessageConverter methods, such as writeInternal, can be useful to override in certain cases. I’ll give a few examples in this article.

In Spring Boot you also need to register a custom MappingJackson2HttpMessageConverter like below:

@Bean
MappingJackson2HttpMessageConverter mappingJackson2HttpMessageConverter() {
    return new CustomHttpMessageConverter();
}

Serialization

Pretty-printing

Pretty-printing in Jackson is disabled by default. By enabling SerializationFeature.INDENT_OUTPUT in the ObjectMapper configuration pretty-print output is enabled (as in the example below). Normally a custom ObjectMapper is not necessary for setting the pretty-print configuration. In some cases, however, like one case of mine in a recent customer project, this configuration might be necessary.

For example, passing a URL parameter can enable pretty-printing. In this case having a custom ObjectMapper with pretty-print enabled and keeping the default ObjectMapper of MappingJackson2HttpMessageConverter as is could be a better option.

public class CustomHttpMessageConverter extends MappingJackson2HttpMessageConverter {

    private ObjectMapper initiatePrettyObjectMapper() {
        ObjectMapper customObjectMapper = new ObjectMapper();
        customObjectMapper.configure(SerializationFeature.INDENT_OUTPUT, true);

        // additional indentation for arrays
        DefaultPrettyPrinter pp = new DefaultPrettyPrinter();
        pp.indentArraysWith(new DefaultIndenter());
        customObjectMapper.setDefaultPrettyPrinter(pp);

        return customObjectMapper;
    }

}

Conditionally Filtering the Fields

When serializing a response object you may need to include or ignore one or more fields depending on their values. Let’s assume a model class UserResponse like below.

Notice that we used @JsonIgnore which is completely discarding the annotated field from serialization. Conditional filtering is different and it can be done using SimpleBeanPropertyFilter objects set to the filter provider of the ObjectMapper objects. Also notice that @JsonFilter annotation is used for UserResponse which points to which filter will be used by ObjectMapper during the serialization.

@JsonFilter("userCodeFilter")
public class UserResponse {

    public Integer userId;
    public String username;
    public Integer code;

    @JsonIgnore
    public String status;

}

Here we add a filter called userCodeFilter—like the one we added to the custom ObjectMapper of CustomHttpMessageConverter—which will include the UserResponse class’s code field in the serialization if its value is greater than 0. You can add multiple filters to ObjectMapper for different models.

public class CustomHttpMessageConverter extends MappingJackson2HttpMessageConverter {

    private ObjectMapper initiatePrettyObjectMapper() {
        ObjectMapper customObjectMapper = new ObjectMapper();
        customObjectMapper.configure(SerializationFeature.INDENT_OUTPUT, true);

        // additional indentation for arrays
        DefaultPrettyPrinter pp = new DefaultPrettyPrinter();
        pp.indentArraysWith(new DefaultIndenter());
        customObjectMapper.setDefaultPrettyPrinter(pp);

        PropertyFilter userCodeFilter = new SimpleBeanPropertyFilter() {
            @Override
            public void serializeAsField(Object pojo, JsonGenerator jgen, SerializerProvider provider, PropertyWriter writer)
                    throws Exception {
                if (include(writer)) {
                    if (!writer.getName().equals("code")) {
                        writer.serializeAsField(pojo, jgen, provider);
                        return;
                    }
                    int intValue = ((UserResponse) pojo).code;
                    if (intValue > 0) {
                        writer.serializeAsField(pojo, jgen, provider);
                    }
                } else if (!jgen.canOmitFields()) {
                    writer.serializeAsOmittedField(pojo, jgen, provider);
                }
            }

            @Override
            protected boolean include(BeanPropertyWriter writer) {
                return true;
            }

            @Override
            protected boolean include(PropertyWriter writer) {
                return true;
            }
        };

        FilterProvider filters = new SimpleFilterProvider().addFilter("userCodeFilter", userCodeFilter);
        customObjectMapper.setFilterProvider(filters);

        return customObjectMapper;
    }

}

Deserialization

JSON String Parse Error Handling in Spring Boot

This one is a little tricky. Deserialization of a JSON @RequestParam object can cause parsing errors if the JSON object is not well-formed. The errors thrown in Jackson’s deserialization level just before it’s pushed to Spring Boot occur at that level, so Spring Boot doesn’t catch these errors.

Deserialization of Jackson maps JSON to POJOs and finally returns the expected Java class object. If the JSON is not well-formed, parsing cannot be done and MappingJackson2HttpMessageConverter internally throws a parsing error. Since this exception is not caught by Spring Boot and no object is returned, the REST controller would be unresponsive, having a badly-formed JSON payload.

Here we can override the internal read method of MappingJackson2HttpMessageConverter, hack the ReadJavaType with a customReadJavaType method, and make it return an internal error when the deserialization fails to parse the JSON input, rather than throwing an exception which is not seen or handled by Spring Boot.

@Override
public Object read(Type type, @Nullable Class<?> contextClass, HttpInputMessage inputMessage)
        throws IOException, HttpMessageNotReadableException {

    objectMapper.enable(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES);

    JavaType javaType = getJavaType(type, contextClass);
    return customReadJavaType(javaType, inputMessage);
}

private Object customReadJavaType(JavaType javaType, HttpInputMessage inputMessage) throws IOException {
    try {
        if (inputMessage instanceof MappingJacksonInputMessage) {
            Class<?> deserializationView = ((MappingJacksonInputMessage) inputMessage).getDeserializationView();
            if (deserializationView != null) {
                return this.objectMapper.readerWithView(deserializationView).forType(javaType).
                        readValue(inputMessage.getBody());
            }
        }
        return this.objectMapper.readValue(inputMessage.getBody(), javaType);
    }
    catch (InvalidDefinitionException ex) {
        //throw new HttpMessageConversionException("Type definition error: " + ex.getType(), ex);
        return "Type definition error";
    }
    catch (JsonProcessingException ex) {
        //throw new HttpMessageNotReadableException("JSON parse error: " + ex.getOriginalMessage(), ex, inputMessage);
        return "JSON parse error";
    }
}

This way you can return errors occurring at the deserialization level to Spring Boot, which expects a deserialized object but gets a String value which can be caught and translated into a ControllerAdvice handled exception. This also makes it easier to catch JSON parsing errors without using any third party JSON libraries like Gson.


What is SharePoint?

$
0
0

Web servers

Image by Taylor Vick

People often ask me about SharePoint, Microsoft’s browser-based collaboration platform which allows users to upload and share all kinds of documents, images, messages, and more. The product has nearly two decades of history and there are still many who don’t know much about it.

The SharePoint platform has grown over those years, but its capabilities have expanded in such a way that it can be quickly dismissed from consideration out of fear of the complexity of its implementation and the cost of deployment. These fears may be unfounded, however. Especially if you are already on Office 365, SharePoint may be included in your plan.

SharePoint was designed as a framework to create and share content on the web without the need to write code. Its purpose was to allow everyone in the organization to collaborate without any specific programming skills. This framework grew over time, adding many different types of content allowing for interactions with other frameworks increasing the effectiveness of any organization’s work product or intellectual property and communications.

Flavors of SharePoint

There are two ‘flavors’ of SharePoint. You can use Microsoft’s cloud-based service or you can host your own on-premises server farm. But I suspect Microsoft’s preference is to wrangle organizations into the cloud, as seen in Microsoft’s SharePoint 2019 online documentation which casually omits references to the on-premises server product. Microsoft offers an inexpensive per-user SharePoint cloud service license for those organizations that don’t want to use Office 365’s other offerings.

On the other hand, on-premises SharePoint Server licensing is very expensive, especially if you wish to design for high availability and create a well-balanced SharePoint server farm. This requires CALs (Client Access Licenses) as well. But the cloud licensing model is very attractive in pricing, especially if you are planning to move your organization’s Exchange email into the Office 365 offering because SharePoint licensing is included in the top two Business tiers and all the Enterprise licensing plans.

Intranets

Over the years I have helped many small- to medium-sized businesses create their intranets using both on-prem SharePoint servers and the SharePoint Online offering, mostly to leverage document management features and their content search capabilities. SharePoint is very good at indexing all Microsoft Office formats, data and metadata, allowing for the inclusion of custom extended tags that can be applied to files and folders to further categorize them and make them easy to organize and find. It also indexes content and metadata from readable PDF format.

Because the environment is highly customizable or “brandable”, companies quickly expand on its use once they are introduced to its basic capabilities. I’m often surprised by how creative non-technical staff can be as they come up with new ways to use the platform.

SharePoint is also a secure way to share documents on a variety of devices including mobile, via the web, leveraging Active Directory or SAML compliant single-sign on (SSO) services like Okta, OneLogin, or Duo for authentication. The framework has its own content permission group capabilities that are simple to manage without giving content-managers access to AD or auth servers. This framework is attractive because it provides, without much training, the ability for employees to create and share content with granular permissions, manage data and custom lists, and create individual web pages or entire sites within the Portal.

SharePoint Sites

Let’s discuss SharePoint Sites. SharePoint allows users with permissions to create individual pages or entire “Team Sites” to organize and secure content with permissions that the site creator can define. The ability to create content and assign these permissions within the organization or with external partners or customers can be delegated by an administrator to team leaders who wish to control their own content. However, global administrators retain the control to secure the company’s data and intellectual property via built-in tools, policies, auditing, and alerts. There is also a reporting system for compliance reporting.

Sites can contain an assortment of content types. Team managers can simply pick from a list of available components, including document and photo libraries, and custom lists that are created much like Excel tables to hold data for distribution, such as employee directories, product lists, and inventory items. Sites can also contain other shared resources such as Wikis, calendars, tasks, issue tracking, and OneNote notebooks. Sites can contain components you can create yourself with Visual Studio leveraging the API. Each site can also hold its own set of usage statistics and workflow management for teams to optimize and hone both performance and effectiveness of the data shared. In my experience sales, finance, and HR teams benefit the most from these ready-made components, but all manner of teams can find useful tools on this platform.

Power Apps, Power Automate (Flow)

Each SharePoint release adds to its predecessor’s core functionality. One such example is “Power Apps”, an add-on service that allows using external data sources capable of interacting with the built-in lists and library components without coding, by establishing connectors and forming mobile content pages.

Another called “Power Automate” (formerly known as “Flow”) can reduce repetitive tasks by using a simple visual designer for scripting actions that respond to triggered events. This varied set of tools also integrates seamlessly with not just the base Office products like Word, Excel, and PowerPoint, but also with Teams and OneNote and even other major services like Exchange and Dynamics. This increases the collaboration options for a mobile workforce regardless of the device or OS they use. This is a very powerful tool for organizations needing to collaborate across different platforms and around the world.

Mobile

There are currently native mobile apps that are free on iOS and Android for mobile access to SharePoint content. This provides an additional layer of security and also compartmentalizing for Mobile Device Management (MDM) systems that run business content in a separate and encrypted container for work on these devices. This may include systems like IBM’s Maas360, SOTI MobiControl, Citrix XenMobile, AirWatch by VMware or Microsoft’s own Microsoft Intune.

With all these out-of-the-box capabilities, SharePoint is perfect for intranet portals, group sites that can be created and managed without any programming experience or knowledge. But if you have the skill set, the entire framework and all its capabilities are created upon a well-documented and time-tested API, with which a person can easily expand with web components that can be created via Visual Studio, Microsoft’s own IDE (Integrated Development Environment), and in several programming languages, allowing integration with other Microsoft API frameworks like MS SQL Server, Exchange, and Dynamics.

Keeping up with content changes

All this functionality and capability sometimes intimidates some businesses. It would seem that you could easily overwhelm your users. To combat that, the whole system, at every level, allows users to “follow” content. This is functionality that has been in the platform since inception and is one of my favorite features. For every page, every site or component at any level, users have the ability to subscribe to content. You can follow with one click any document and be notified in your portal homepage or via notifications of actions taken or content changed. This means that you can allow people to be as connected as they feel they need to be. Nobody needs to suffer from information overload or out-of-control mobile notifications.

We can help!

If you are already a Microsoft Office 365 subscriber, odds are you already have access to this incredible tool without any additional cost. How can you leverage this for your business? Visit the following content, or contact us at End Point for additional information.

Web Projects for a Rainy Day

$
0
0

raindrops on a plant

Image by Yellowstone NPS on Flickr

With the COVID-19 quarantine disrupting life for many of us, I thought I’d put together a list of things you can do with your website on a rainy day. These are things to keep your business moving even if you’re at home and some of your projects are stuck waiting on things to reopen. If you’re looking for some useful things to do to fill your days over the next few months, this post is for you!

Major Version Updates

Make a list of your entire stack, from OS to database to development frameworks. Note the current version and research the current supported versions. I find Wikipedia pages to be fairly reliable for this (e.g. https://en.wikipedia.org/wiki/CentOS). Ok, so what things need to be updated, or will need to be in the next year? Start on those now and use some downtime to get ahead of your updates.

Sample of a client’s stack review

SoftwarePurposeOur versionRelease dateEnd of supportNext updateNewest versionNotes
CentOSOS for e-commerce server7July 2014June 2024Not imminent8https://wiki.centos.org/About/Product
NginxWeb server1.16.0March 2020UnclearNot imminent1.16.1https://nginx.org/
PostgreSQLDatabase server9.5.20January 2016Feb 2020Medium term, to version 1112https://www.postgresql.org/support/versioning/
RailsApp framework for store5.1February 2017CurrentLong Term, to version 66https://rubygems.org/gems/spree/versions
ElasticsearchSearch platform for product import/search5.6.xSeptember 2017March 2019Immediate, to version 6.87.4https://www.elastic.co/support/eol
WordPressInfo site5.2.3September 2019Unclear5.2.4 shipped recently5.2https://codex.wordpress.org/Supported_Versions

Content Cleanup & SEO Review

Everyone’s website gets cluttered with outdated content. Take a look at your pages, review, and update what needs to be changed. Pay attention to search engine optimization (SEO) concerns as you go through it. Make sure your content has headers, accurate keywords, and good meta-descriptions. Research SEO best practices if you need a refresher.

Nowadays, reducing repeated content has huge benefits for SEO so we recommend any content review includes a review of duplication. If you have a small site, you can go through your content and SEO manually. Larger projects can utilize tools on the market such as Siteliner or WPOptimize.

While you’re taking a dive into content, don’t forget to review your Google Analytics and understand what content is being used and what isn’t. Google has added many new features to Analytics and Ads, so it’s a good idea to refresh yourself on the updated documentation and new features.

Reporting

A lot of clients with big ecommerce data sets or other applications that collect data benefit from a separate reporting or business analytics tool. A rainy day can be a good time to think about what reports you want on last year’s business, what data will help you plan for the future. End Point has worked with a few different reporting tools that easily add on to your database, like Pentaho or Jasper and those can be really useful.

Documentation

I wouldn’t be a good project manager if I didn’t throw this one in the list. Documentation is so, so important, yet really we can always do more. End Point uses a few different tools, including wikis running MediaWiki and Google Docs, for keeping track of project details. Now’s a good time to set up a nice documentation system or do a big review and make sure everything is updated and back in order. Maybe dream of a vacation you might be able to take when this is over and make sure everything’s ready for you to do that.

Disaster Recovery Tests

For anyone with business-critical infrastructure, you need to ensure you know how to get everything back up and running in the case of a major failure, either with on-premises or cloud hosting. Now’s a good time to clarify with your hosting vendor things like: What are your backups like? What is your disaster recovery plan? What is the timeline for recovering the application in the event of a major failure? If you can, take time to do a simulation and make sure all the pieces are there if they need to be. Simply said, we also need to test our backups in order to ensure that they actually work.

Redesign

If you’ve been meaning to refresh your website, a rainy day is prime time to do it. Designers and developers are looking for projects and you’ll have extra time on your hands to oversee the process, spend time reviewing and testing, and get things done just the way you want.

Automated Testing

Good developers want an automated test suite as part of their application. Not all applications were built with this from the beginning and many didn’t have the budget or time to get it done. With extra time on your hands, this can be a great time to start building your test suite or to improve the coverage of your existing one.

Unit tests in particular are a good place to start. Unit tests are great not only because they help validate software correctness and protect against regressions, but also because they require a well-factored and modular system. This means that, while writing your unit tests, you will often be forced to go back to your application’s code base and refactor it to make it testable, to make it better. Investing in creating a solid unit test suite is a great bang for your buck. You can also look at implementing continuous integration—having a pipeline to let multiple developers deploy code throughout the day and configure your automated tests into the workflow.

Versioning & Deployment Tools

When you’re cleaning house, take a look at your Git version control repository and make sure everything important is in there. We have a few projects that have a main project in Git, but sometimes the smaller projects and one-offs can go astray. This is a good time to get everything organized into one repository, or make sure external repositories are connected and integrated.

Automated DevOps deployment tools can also be nice to work on. Tools like Ansible and Chef can take a lot of time to set up and test, but they have some great time-saving and accuracy advantages down the line. Our in-house security experts also recommend tools like AIDE and OSSEC which automate monitoring file changes daily.

Security Audit and Monitoring

Taking a few days to review your personal security and that of your application is something you should do regularly and now’s a good time to plan for it. Charlie’s got a great security post that’s a good top-level review. For application security, End Point uses some tools for vulnerability scanning. We also have a checklist of basic security items that include password handling, PII data, and other common security holes. For certain projects/clients we must also take HIPAA or PCI DSS compliance into account. Also, don’t neglect to review your TLS status and ensure that web applications run on TLS 1.2 and are TLS 1.3 ready. This also may relate to the underlying operating systems—whether they are able to support the latest TLS version natively.

Optimization and Performance:

Most of the time new features have higher priority than improving the performance of an existing system. It could be the right time to review core functionalities and list out the areas that need improvement in serving a better experience to customers by optimization. The areas can be focused on optimizing code, database queries, image size, data compression over network, adding cache, CDN, and so on. We’ve been moving quite a few clients to the Cloudflare DNS and CDN service and we’ve been really happy with it. Optimization work will definitely influence the customer retention rate which helps to increase profitability long term.

Refactoring

Along the same lines as optimization, code refactoring can have long term gains in performance and ease of future development. Think of this like house cleaning. It is always easier to find any item in the house when things are arranged in an orderly manner. Similarly, the organized and clean code base will play a vital role in future code changes and development, helping to reduce the chances of unexpected bugs, save time making changes at one place and improving code readability. Disciplined refactoring delivers readable, reusable, non-redundant code. Refactoring can be applied to your databases and user interfaces as well.

Want to get started on some background projects for your website? Talk to us today.

An Introduction to webpack 4: Setting Up a Modern, Modular JavaScript Front-End Application

$
0
0

Banner

Image taken from https://webpack.js.org/

I’ve got a confession to make: Even though I’ve developed many JavaScript-heavy, client side projects with complex build pipelines, I’ve always been somewhat confused by the engine that drives these pipelines under the hood: webpack.

Up until now, when it came to setting up a build system for front-end development, I always deferred to some framework’s default setup or some recipes discovered after some Googling or StackOverflow-ing. I never really understood webpack at a level where I felt comfortable reading, understanding and modifying a config file.

This “learn enough to be effective” approach has served me well so far and it works great for being able to get something working, while also spending time efficiently. When everything works as it should, that is. This approach starts to fall apart when weird, more obscure issues pop up and you don’t know enough about the underlying system concepts to get a good idea of what could’ve gone wrong. Which can sometimes lead to frustrating Googling sessions accompanied with a healthy dose of trial and error. Ask me how I know...

Well, all that ends today. I’ve decided to go back to basics with webpack and learn about the underlying concepts, components and basic configuration. Spoiler alert: it’s all super simple stuff.

Let’s dive in.

The problem that webpack solves

webpack is a module bundler. That means that its main purpose is taking a bunch of disparate files and “bundling” them together into single, aggregated files. Why would we want to do this? Well, for one, to be able to write code that’s modular.

Writing modular code is not as easy in JavaScript that runs in a browser as it is in other languages or environments. Traditionally, the way to achieve good modularity in the web front-end has been via including separate scripts via multiple <script> tags within HTML files. This approach comes with its own host of problems. Things like the order in which the scripts are included suddenly matter, because the browser executes them top to bottom, which means that you have to be very careful to include them in an order where dependencies of the later files are included first. Also, this approach encourages the pollution of the global scope, where every script declares some global variables which are then used by other scripts. This is problematic because it is not clear which scripts depend on which ones. In other words, dependencies become implicit and hard to track. Unit testing becomes harder as well, since you need to mock these dependencies at a global scope. You also run the risk of some scripts overriding these global resources anytime. It’s just not very clean.

Over the years, solutions have been devised by the community to tackle this issue, which have taken the form of module loaders like CommonJS and AMD. Finally, ES2015 introduced a native module loading system into the language itself via the import and export statements. The life of JavaScript in the browser is not so easy though, as these new specifications take time to implement by the various browser vendors. This is where webpack comes in. Part of its offerings is the ability to use this standard, modern module loading mechanism without having to worry about browser compatibility. This unlocks the potential for front-end developers to write modern, beautiful, modularized JavaScript. This is huge.

Concepts: Entry points and output

Now, what does that look like in webpack terms? Well, we define an entry point, an output specification, and let webpack do its thing.

This is a good time to formally introduce two of webpack’s core concepts. First, we have entry points.

Entry points represent the starting point of execution for your app or page, and as such, serve as the starting file for webpack to begin building up your bundle. These are normally JavaScript files that bootstrap the execution of whatever application or page that you are developing. From a software design perspective, they tend to import many other modules and start up the scripts and instantiate the objects that actually run the application logic.

So, for example, say you have a file named index.js, which has some front-end logic and uses some utility classes living in separate files like SomeServiceClass.js or ApiClient.js. This index.js is a great candidate for being an entry point for your application or page because it is the one singular file that calls upon all of the other dependencies/modules.

We also have output. Output is the result of a webpack bundling operation. When webpack takes our entry points, it builds and compiles their corresponding dependencies and produces a single JavaScript file that can be directly included into our page or app via a <script> tag. This file is our output. This is the only file that needs to be included in the final page, because webpack took all the separate dependencies and bundled them together in one single package.

Introducing our demo app and its problems

But let me show rather than tell. Consider a simple calculator application, whose source code file structure looks like this:

.
├── index.html
└── js
    ├── calculator
    │   ├── calculator.js
    │   └── operations
    │       ├── addition.js
    │       ├── division.js
    │       ├── multiplication.js
    │       ├── operation.js
    │       └── subtraction.js
    └── index.js

You can explore the source code for this small application here. Feel free to go ahead and download it if you want to work along with me.

I’ve called the app the “Automatic Calculator” (not to be confused with its much less powerful alternative, the “manual” calculator!) and you will be able to figure out its general architecture pretty quickly.

In the root directory, we’ve got index.html which contains the GUI for our app. Then, all the behavior is inside the js directory.

index.html is pretty straightforward. It’s got a simple form with two fields for typing in numbers, and a button that will automatically run a few arithmetic operations on those numbers. The results are presented right there in the page just a few pixels below the form.

For the purposes of this post, the interesting part of that file comes near the bottom, where we include all of our JavaScript logic. It looks like this:

<script src="js/calculator/operations/operation.js"></script>
<script src="js/calculator/operations/addition.js"></script>
<script src="js/calculator/operations/subtraction.js"></script>
<script src="js/calculator/operations/multiplication.js"></script>
<script src="js/calculator/operations/division.js"></script>
<script src="js/calculator/calculator.js"></script>
<script src="js/index.js"></script>

As you can see, our little Automatic Calculator logic is separated into a series of files. And right now we start seeing some of the drawbacks of not using any sort of module loading for our app. This page has to include all of the script files that it needs separately. What’s worse, the order in which they are included matters. For example, since the js/calculator/calculator.js file depends on the js/calculator/operations/multiplication.js file to work, multiplication.js needs to be included first. Otherwise, the page will break. From this page’s perspective, it would be much easier and cleaner if it could just include one file, one “bundle” that has everything it needs.

If we look at our script files, we see more related problems. Consider js/index.js, for example. This is the file that starts up the app. It defines an App class which it then instantiates and runs. Here’s what it looks like (explore the source code in the git repo if you want to see the whole thing):

class App {
    constructor() {
        this.calculator = new Calculator();

        /* Omitted */
    }

    run() {
        /* Omitted */
    }
}

console.log("Starting application");

let app = new App();
app.run();

console.log("Application started");

The App class’ constructor is creating a new instance of the Calculator class. The problem is that, from the point of view of the reader of this file, it’s doing so out of thin air. There’s no indication whatsoever that this file depends on and uses the Calculator class. It is available here only because the file that contains that class happens to be included in a <script> tag in the same page that is using js/index.js. That’s hard to read and maintain as the dependency is implicit in a place where it should be explicit. Imagine if js/index.js was a couple hundred lines bigger and had a few dozen more dependencies. That’d be very hard to manage. You would have to read the entire file to get an idea of how the code is structured. To be able to reason about the code from a higher level, you would have to go to the really low level. Too much cognitive overhead.

The same thing happens in the js/calculator/calculator.js file. It defines the Calculator class and that class depends on other classes (Addition, Subtraction, etc). Which are also defined in other JavaScript files but the js/calculator/calculator.js does not explicitly say that those are dependencies that it needs. They are called up out of thin air. Everyone is trusting in index.html to include all the separate script files where these classes are defined and that it does so in the proper order. Too much responsibility for little old index.html. And also, what would happen if somebody wanted to reuse the Calculator class in another page for instance? That developer would have to know all of the dependencies for that class and include them manually in the new page. What if the file were much bigger, with more dependencies? That can get really ugly really quickly.

Luckily for us, those are exactly the kinds of problems that webpack helps us deal with. So let’s start refactoring this little app so that it can take advantage of a webpack-based build process.

Installing webpack into our project

If you are following along and downloaded the source code from the GitHub repo, note that whenever I say “root”, I mean that repo’s original directory. That’s where the original version of the Automatic Calculator app’s source code lives as it is before introducing webpack. You can work directly from there or take the contents of that directory and put them wherever it is most comfortable to you. The final directory contains the source code as it will be by the end of this post.

The first thing we need to do is install webpack into our project. That’s super easy if you already have Node.js and NPM installed. How to install Node.js and NPM is out of the scope of this discussion, so I would recommend following Node.js’s documentation to get them installed.

Once you have that, go to our project’s root and run npm init -y. This will create a package.json file with some default configuration. This effectively makes our code a proper Node.js project.

After that, installing webpack is as easy as going to our project’s root and running:

npm install webpack webpack-cli --save-dev

That will create a new node_modules directory and install both the webpack and webpack-cli packages as development dependencies. They are installed as dev dependencies because, in production, our app won’t actually need webpack to run properly. webpack will only help us build the deployment assets for our app, and that happens at development time.

The webpack configuration file

Now, we need to create a webpack.config.js file which tells webpack how to take our files and produce said deployment assets, i.e. the compiled bundle. Here’s what a simple config file tailored to our app would look like.

const path = require('path');

module.exports = {
    mode: "development",
    entry: "./src/index.js",
    output: {
        filename: "index.js",
        path: path.resolve(__dirname, "dist"),
    },
};

Let’s discuss it line by line.

First, there’s const path = require('path'); which is just including the Node.js native path package that the config file uses further down below.

Then there’s the module.exports definition. Conveniently, webpack’s configuration values are organized within a JavaScript object. Setting module.exports equal to that object makes it available to other JavaScript files that import this file.

Now we have the actual webpack build settings. The mode field can be either development, production, or none. Not super important for us right now, but depending on what you set here, webpack will apply different optimizations to the bundles.

The entry field defines the entry point for the build process. As discussed before, this is the file that webpack will start from when figuring out the entire dependency graph. That is, all of the files that specify that they need one another to work via import and export statements (more on that later). In our app, the dependency graph looks like this:

Image

In other words, index.js depends on calculator.js. calculator.js in turn depends on addition.js, subtraction.js, multiplication.js, and division.js. Finally, all of the latter four depend on operation.js. Here, we have specified ./src/index.js as our entry point. But wait, our index.js file lives inside our js dir. What gives? We’ll change that soon, when we prepare our files to use ES2015 modules and have an overall more conventional organization, as far as webpack is concerned.

Finally, with the output field, we tell webpack what we want it to produce after bundling together all those files. In this case, we’ve configured it to produce a file called index.js inside the dist directory.

Refactoring our project to use ES2015 modules

Like I discussed before, webpack allows us to express our dependencies using the now-standard import and export statements, which are natively supported as of JavaScript's language level ES2015. Let’s do that. This is super easy, and you will understand it immediately if you have used any other language that supports modules like C#, Java, Python, PHP... Yeah, pretty much any other language out there BUT JavaScript.

Anyway, we have to add this line at the beginning of index.js:

import Calculator from "./calculator/calculator.js";

Nice, this is an explicit dependency declaration for our index.js file. Now webpack (and readers of our code!) can know that this file uses the Calculator class. And like that, we go file by file adding dependencies. In calculator.js, we add:

import Addition from "./operations/addition.js";
import Subtraction from "./operations/subtraction.js";
import Multiplication from "./operations/multiplication.js";
import Division from "./operations/division.js";

In all of those four files, we add:

import Operation from "./operation.js";

It’s all pretty self-explanatory. The syntax is import <CLASS_NAME> from "<RELATIVE_FILE_PATH>"; where the path is relative to the location of the file in which we’re adding the import statement.

Now, import statements are for specifying which other files a given file needs to work. To allow the code elements that are defined within a file to be imported by others though, they need to be exported. We do that by annotating the code elements that we want to make available with the export statement. export basically specifies what parts of a module are available for others to use.

Our code is factored in a very simple way, where the only thing defined in a given file is a class. So, in order for all the import statements that we added to work, we just need to go file by file making changes like this:

-class Calculator {
+export default class Calculator {

All we did was add the export and default keywords to the definition of the Calculator class in calculator.js. export makes it so we are allowed to import that class elsewhere, and default allows us to use the style of import that we used before: the import Calculator from "./calculator/calculator.js"; one.

By the way, yes, there are other styles of import and the default keyword is optional. To learn more about import and export statements and JavaScript modules in general, I’d recommend a few resources from MDN: modules, import, and export.

Alright, so after adding export default to all of our class definitions (except for the App class defined in index.js, we don’t need to export it because we don’t import it anywhere) we have almost all the pieces of the puzzle ready for our webpack build to work.

The only thing left is to rename our js directory to src to match our webpack configuration. There’s no particular reason for using src instead of js, other than it being a good way to clearly separate the location of the source code versus the location of the compiled assets, Which will live in dist.

Anyway, with that rename done, you can go ahead and run

npx webpack --config webpack.config.js

As a result of that command, you should see a new dist directory with a single index.js file in it. Just like we specified as the output in our webpack configuration file.

And that’s it! We have set up a webpack build process for our app. Don’t forget to change the index.html page so that it only includes this new compiled asset. That should look something like this, at the end of the <body>:

<script src="dist/index.js"></script>

You should be able to open up the page in your browser and marvel at what we’ve accomplished:

Image

The other problem that webpack solves

Ok, we’ve made a great accomplishment here. We’ve managed to leverage webpack to write modular JavaScript. The importance of this cannot be understated. However, webpack offers us much more. Another problem that has historically plagued front-end web development is compatibility across multiple browsers. webpack helps us with that as well via loaders.

Concept: Loaders

In a nutshell, loaders are components that webpack uses to transform source code files during its bundling process, in order to make them available to the app. They basically allow webpack to turn things other than JavaScript into modules that can be used by the application via import. By itself, webpack only knows how to process JavaScript and JSON files. Using loaders, however, webpack can also process other types of files like CSS and its various flavors like LESS or SASS, or even JavaScript derived languages like TypeScript or CoffeeScript.

Writing modern JavaScript with Babel

Here’s where the browser compatibility solution that I alluded to earlier comes into play: babel-loader. This loader makes it so we can write bleeding edge JavaScript, with all the newest features of the language, without having to worry if our target browsers support them.

Babel is a compiler that takes code written with later versions of the language and turns it into backwards-compatible JavaScript that can run in older browsers. babel-loader is how Babel integrates with webpack. I like being able to take advantage of JavaScript’s latest features, so, for me, setting up Babel in any new project is a must. Luckily for us, with webpack, setting it up is easy.

You can learn more about Babel and the latest JavaScript features here and here.

Like most JavaScript packages, Babel is distributed as an NPM package. So let’s go ahead and install it with

npm install --save-dev babel-loader @babel/core @babel/preset-env

This will install the core Babel package as well as the loader for webpack, and the env preset.

The concept of a “preset” is actually a Babel-related one, not really having anything to do with webpack. You can learn more about them here, but, for our purposes, suffice it to say that Babel presets are a specification of which features are available to use. There are many presets (i.e. feature sets) against which you can configure Babel to compile. env is just a very convenient one that provides support for all the latest language features.

Again, we’re installing these as dev-only dependencies because they are not needed for the app at runtime, only at build time.

Now, we go to our webpack.config.js file and add these lines:

module.exports = {
    // ...
    module: {
        rules: [
            { test: /\.js$/, exclude: /node_modules/, loader: "babel-loader" },
        ]
    }
};

This is how we specify that we want babel-loader to transform our JavaScript files before webpack bundles them together. Inside module.rules, there’s an array of objects. Each of the elements of that array is a rule specifying on one hand which files it applies to, via the regular expression in the test field, and on the other hand which loader will be used to process them via the loader field. The exclude field makes sure that files under our node_modules directory are not affected by babel-loader. We don’t want to be transforming the packages we downloaded from NPM after all, those are ready to use as they are. We only want to transform our own code.

In summary, this rule tells webpack to “run all .js files by babel-loader except for those inside node_modules”. babel-loader will make sure to transform them into plain old JavaScript before giving them back to webpack for bundling.

Finally, Babel itself requires a little bit of configuration. So let’s give it what it needs by creating a .babelrc file in our project’s root with these contents:

{
    "presets": ["@babel/preset-env"]
}

Pretty self-explanatory. This configuration tells Babel to use the env preset when processing our files.

Now run npx webpack --config webpack.config.js and hope for the best. Just kidding! Everything should work like a charm and with that, we have just unlocked JavaScript’s full potential for our project. Open the page again and you will see nothing has changed. What we have gained is the ability to write modern JavaScript without having to worry about compatibility.

Bonus: Multiple entry points

When building an SPA, a single entry point is appropriate. However, we often have apps with several pages, each one of which is a sort of advanced front-end application in its own right. From a webpack perspective, apps like that will have multiple entry points, one for each page. This is because each page has its own set of client side code that runs independently from the other pages. They may share common parts under the hood (via class or library reuse, for example) but the main, initial script is different.

Let’s consider our calculator. So far, that app has only one page. Imagine we want to add a new one. Let’s say, an admin control panel. To make that happen, let’s add new admin.html and src/admin.js files to the project. Their contents don’t matter. All I need is for them to exist in order to illustrate the multiple entry points capability.

As far as webpack is concerned, we can configure the build process to support that style of application by updating our webpack.config.js like so:

 module.exports = {
     mode: "development",
-    entry: "./src/index.js",
+    entry : {
+        index: "./src/index.js",
+        admin: "./src/admin.js"
+    },
     output: {
-        filename: "index.js",
+        filename: "[name].js",
         path: path.resolve(__dirname, "dist"),
     },
 }

As you can see, we’ve changed our config object’s entry field to include two separate entry points, one for our existing index.js entry point and another for the new admin.js one. We’ve also tweaked the output configuration; instead of a static name for the resulting output bundle, we use a pattern to describe them. In this case, [name].js makes it so we end up with two bundle files named after their corresponding entry points. The [name] part is where the magic happens. When creating the compiled bundle files, webpack knows to substitute that variable with the corresponding value from the entry point configuration.

Go ahead and run npx webpack --config webpack.config.js again and inspect the dist directory. You will see that we now have two bundles: index.js and admin.js.

dist/
├── admin.js
└── index.js

The new admin.js file can be added to the new admin.html page like usual via a <script> tag, just like we did with index.html.

Bonus: CSS can also be a module

When discussing loaders, I mentioned the possibility of making things other than JavaScript behave like modules and be available to the application. Let’s demonstrate how we can make that happen.

First, let’s install the loaders that we need with:

npm install --save-dev css-loader style-loader

Then, create a new CSS file under /css/index.css with some fabulous styling:

body {
    background-color: aquamarine;
}

button, input {
    border: grey solid 1px;
    background-color: white;
    padding: 5px;
}

input:hover, button:hover {
    background-color: teal;
}

Now import it into our index.js file with a line like this near the top of the file:

import "../css/index.css";

And finally, let’s configure webpack so that it knows what to do when it finds this weird import css line. We do so by updating our module.rules config:

 module: {
     rules: [
         { test: /\.js$/, exclude: /node_modules/, loader: "babel-loader" },
+        { test: /\.css$/, use: ['style-loader', 'css-loader'] }
     ]
 }

What we did here was add the new loaders to the webpack build pipeline. With this, it now knows how to handle CSS files and apply whatever styling rules are defined in that CSS to the page in question. Pretty neat trick, huh?

These two particular loaders are doing more than meets the eye. You can learn more about them here and here.

Further reading: Plugins

There’s one core concept that we haven’t discussed yet: plugins. Plugins offer yet another avenue for customizing webpack’s behavior. I won’t go into too much detail on them here because I think that with the understanding we have gained on entry points, outputs, and loaders, we’ve added really powerful tools into our toolboxes. These tools will be enough in most cases, or at least allow us to get new projects up and running, but with more insight into what’s actually happening under webpack’s hood. If you are so inclined, you can learn more about them here. In a nutshell, plugins are for “doing anything else that a loader cannot do”, as webpack’s own documentation puts it.

~

And that’s it for now. Thanks for joining me in exploring webpack, a pretty neat piece of software that makes many a front-end developer’s life easier, mine included.

Salesforce Integration with Node.js

$
0
0

Patterned roof

Photo by Dylan Wooters, 2020

Salesforce is huge. It is currently the dominant customer relationship management (CRM) provider, accounting for around 20% of market share. Businesses are using Salesforce not only as a traditional CRM solution, but also for novel purposes. Salesforce can serve as a backend database and admin portal for custom apps, or as a reporting tool that pulls data from various systems.

This growth leads to increasing demand for Salesforce integrations. The term “Salesforce integration” may conjure up images of expensive enterprise software or dense API documentation, but it doesn’t have to be that way. You can work with Salesforce easily using Node.js and the npm package JSforce. An example of a project that might benefit from this kind of Node.js integration is an e-commerce website where order data is loaded to and from Salesforce for order fulfillment, tracking, and reporting.

In this post we’ll cover how to connect to Salesforce using JSforce, the basics of reading and writing data, as well as some advanced topics like working with large amounts of data and streaming data with Socket.IO.

Setting Up

You’ll first want to install Node.js on your local machine, if you haven’t done so already.

Next, create your Node app. This will vary with your requirements. I often use Express to build a REST API for integration purposes. Other times, if I am routinely loading data into Salesforce, I will create Node scripts and schedule them using cron. For the purposes of this post, we will create a small Node script that can be run on the command line.

Create a new directory for your project, and within that directory, run npm init to generate your package.json file. Then install JSforce with npm install jsforce.

Finally, create a file named script.js, which we will run on the command line for testing. To test the script at any time, simply navigate to your app’s directory and run node script.js.

At the top of the script, require jsforce, as well as the Node IO libraries fs and path. Then define an asynchronous function that will serve as our script body. This is where all of your Salesforce code will go.

var jsforce = require('jsforce');
var fs = require('fs');
var path = require('path');

run();
async function run(){
   //salesforce code goes here...
}

Connecting to Salesforce

I usually store my Salesforce credentials and instance URL as a JSON object in a separate file, which I gitignore. This ensures that sensitive data does not appear in Git. Below is the content of my salesforce-creds.json file. You’ll want to add your Salesforce username and password and update the instance URL, if necessary.

{
   "username": [your username],
   "password": [your password],
   "url": "https://na111.salesforce.com"
}

To connect to Salesforce simply retrieve the credentials from the file and use them with the JSforce Connection class to login. Be sure to wrap all JSforce code in a try-catch block, to catch any errors coming back from Salesforce.

let creds = JSON.parse(fs.readFileSync(path.resolve(__dirname,'./salesforce-creds.json')).toString());
let conn = new jsforce.Connection({ loginUrl : creds.url });
try {
   await conn.login(creds.username, creds.password);
   console.log('Connected to Salesforce!');
   //now you can use conn to read/write data...
   await conn.logout();
} catch (err) {
   console.error(err);
}

Reading, Writing, and Deleting Data

Once connected, the easiest way to query data from Salesforce is to use the JSforce query function, and pass in an SOQL statement. This offers the most flexibility, as you can run queries for child and parent objects. Using SOQL, we can query all accounts and their contacts (children) in a single statement. Note, however, that there are limitations on relationship queries. You can only go down one level, from parent to child, but you can go up multiple levels from child to parent.

Writing and deleting data is simple with JSforce using the sobject class and the corresponding create/update/delete function. In the example below, we will query for accounts and contacts using SOQL, and then isolate and update a specific contact using sobject().update.

let soql = `select id, name,
    (SELECT Id, FirstName, LastName, Email_Verified__c, Enrollment_Status__c from Contacts)
    FROM Account`;
let accounts = await conn.query(soql);
let cooper = accounts.records
    .filter(x => x.Name === 'Twin Peaks Sheriff Dept.')[0].Contacts.records
    .filter(y => y.FirstName === 'Dale' && y.LastName === 'Cooper')[0];
console.log(cooper);
//Console output:
// { attributes:
//     { type: 'Contact',
//       url: '/services/data/v42.0/sobjects/Contact/0033h000001sDzDAAU' },
//    Id: '0033h000001sDzDAAU',
//    FirstName: 'Dale',
//    LastName: 'Cooper',
//    Email_Verified__c: true,
//    Enrollment_Status__c: 'Pending'
//  }
cooper.Enrollment_Status__c = 'Accepted';
let ret = await conn.sobject('Contact').update(cooper);
if (ret.success) {
    console.log('Contact updated in Salesforce.');
}

Working with Large Amounts of Data

You may need to read and write large amounts of data, for example if you are using Salesforce for reporting and loading data to and from other systems.

Event-driven Querying

The record limit for standard promise-style SOQL querying, as in our example above, is 2000 records. To query more than that, it is best to shift to the event-driven style of querying. This will ensure that all records are successfully retrieved from Salesforce. You can use the maxFetch property to set the upper limit of records returned. By default, maxFetch is set to 10,000.

let contacts = [];
let soql = 'SELECT Id, FirstName, LastName, Email_Verified__c, Enrollment_Status__c from Contact';
let query = await conn.query(soql)
.on("record", (record) => {
    contacts.push(record);
})
.on("end", async () => {
    console.log(`Fetched Contacts. Total records fetched: ${contacts.length}`);
})
.on("error", (err) => {
  console.error(err);
})
.run({ autoFetch : true, maxFetch : 5000 });

Loading Data with the Bulk API

Loading a large amount of data into Salesforce is best accomplished through the Bulk API via JSforce. There are a couple good reasons for this approach.

The Bulk API has better performance over other methods when working with large collections of objects.

The standard JSforce sobject create/update/delete functions have a 200 object limit. For operations on large collections, you must divide the total by 200, resulting in many separate API calls. By contrast, the Bulk API only uses a single API call. Since Salesforce imposes API limits, this makes the Bulk API a better choice.

Running a bulk operation is simple using the bulk.load method, which takes three parameters: the Salesforce object type, the operation type, and an array of objects. The method returns an array of objects with success/errors fields, as well as the id of the newly created object, if successful.

If you’re working with thousands of objects, it’s good to set the pollTimeout property manually to one minute or more, to avoid Salesforce connection timeouts. Also note that the possible values for operation type are: ‘insert’, ‘update’, ‘upsert’, ‘delete’, or ‘hardDelete’.

//set poll timeout to one minute for larger datasets
sfConnection.bulk.pollTimeout = 240000;
//normally you will have thousands of Accounts, this is just an example
let accounts = [
    { Name: 'Saul Goodman, LLC' },
    { Name: 'Los Pollos Hermanos Inc' },
    { Name: 'Hamlin, Hamlin & McGill' }
];
let results = await conn.bulk.load('Account','insert', accounts);
console.log(results);
// Console output:
// [ { id: '0013h000006bdd2AAA', success: true, errors: [] },
// { id: '0013h000006bdd3AAA', success: true, errors: [] },
// { id: '0013h000006bdd4AAA', success: true, errors: [] } ]
if (accounts.length === results.filter(x => x.success).length){
    console.log('All account successfully loaded.');
}

WebSocket Streaming with Socket.io

Say you are building a web application for reporting, and the app contains a dashboard with data on all of your contacts in Salesforce. You want the dashboard to be updated whenever the data in Salesforce changes, and you also want this to happen without refreshing the web page.

To accomplish this, you can stream real-time data from Salesforce using JSforce and the Socket.IO library, which makes working with WebSockets quite simple.

The first step in this process is creating a PushTopic in Salesforce. This is basically a trigger that emits a notification anytime an object is created, updated, etc. in Salesforce. I created a PushTopic for Contacts by running the following Apex code in the Salesforce developer console.

PushTopic pushTopic = new PushTopic();
pushTopic.Name = 'UserChange';
pushTopic.Query = 'SELECT Id, FirstName, LastName, Email_Verified__c, Enrollment_Status__c FROM Contact';
pushTopic.ApiVersion = 48.0;
pushTopic.NotifyForOperationCreate = true;
pushTopic.NotifyForOperationUpdate = true;
pushTopic.NotifyForOperationUndelete = true;
pushTopic.NotifyForOperationDelete = true;
pushTopic.NotifyForFields = 'Referenced';
insert pushTopic;

Then, back in your Node app, install Express and Socket.IO.

Next, you’ll want to create a very basic Express server that will listen for updates from the Salesforce PushTopic, and emit them to your reporting site. Start by installing Express and Socket.IO.

npm install express
npm install socket.io

Then delete the run function in your script.js file, which contained the code from the samples above, and replace it with the following:

async function run(){
  //listen with express
  server.listen(3000, function(){
      console.log('listening on *:3000');
  });

  //connect to Salesforce
  let creds = JSON.parse(fs.readFileSync(path.resolve(__dirname,'./salesforce-creds.json')).toString());
  let conn = new jsforce.Connection({ loginUrl : creds.url });
  try {
      await conn.login(creds.username, creds.password);
  } catch (err) {
      console.error(err);
  }

  //when the client connects, emit streaming updates from salesforce to client
  io.on("connection", (socket) => {
     console.log('A socket connection was made!');
     let eventHandler = (message) => {
          console.log('New streaming event received from Salesforce:', message);
          socket.emit('UserChange', message);
      };
     conn.streaming.topic('UserChange').subscribe(eventHandler);
  });
}

Here is a step-by-step description of what is occurring in the code sample above:

  • The Express server is set to listen for connections on port 3000.
  • We connect to Salesforce and login.
  • Socket.IO is set to listen for incoming connections from clients.
  • A function called eventHandler that emits Salesforce streaming messages to the client is defined.
  • When a connection is made, eventHandler is attached to the Salesforce streaming topic as a callback, using the live Salesforce connection.

If you follow the nice little tutorial from Socket.IO and create the sample chat webpage, you can actually test the Salesforce streaming updates. In the chat page, add this script, which will log messages coming back from Salesforce.

<script>
   var socket = io();
   socket.on('UserChange', function(msg){
     console.log(msg);
   });
</script>

Then update a contact in Salesforce, changing the contact’s first name. If everything works correctly, you should see the client connect via Socket.IO in the Node logs, and also see a streaming message from Salesforce logged in the browser’s console window.

Summary

Node.js and JSforce provide a straightforward and elegant way to interact with Salesforce. Whether you have an existing Node API that needs to work with Salesforce, or you are building a new application that is powered by Salesforce data, consider the recipes above as stepping stones for completing your project.

Magento 2: Creating a custom module

$
0
0

Bridge with wires

Photo by Babatunde Olajide, cropped from original

A Magento module is a set of classes and routines that will depend on and interact with other Magento classes in order to add a specific feature to a Magento application. While a theme is orientated towards the front-end and user experience, a module is orientated towards backend logic and application flow.

We will need to create a custom module if we want to add or change the existing logic at a level where Magento doesn’t provide a setting or option for it. For example, if our business has a specific feature or set of features or requirements that are not common to the market, a module can fill that gap for us.

Creating a basic Magento 2 module

Creating a simple module in Magento 2 is not that hard. We will need to accomplish the following tasks:

  • Create a new directory for the module
  • Create a registration.php script
  • Create a etc/module.xml information file
  • Install the new module

Creating a new directory for the module

Where should the new directory for our module be placed? We have two options to choose from:

  • app/code/{vendor}/
  • vendor/{vendor}/

If your module is intended for a specific website you’re working on, you can use the first option. If you’re creating a module with the intention of it being used on several websites, it’s best to choose the second option. We’ll use the first for this example.

Let’s create a directory named EndPoint (our vendor name) with a subdirectory inside it, MyModule:

cd {website_root}
mkdir app/code/EndPoint
mkdir app/code/EndPoint/MyModule

Creating the registration.php script

The registration.php file tells Magento to register the new module under a specific name and location. Let’s create a file named app/code/EndPoint/MyModule/registration.php with the folllowing content:

<?php
\Magento\Framework\Component\ComponentRegistrar::register(
    \Magento\Framework\Component\ComponentRegistrar::MODULE,
    'EndPoint_MyModule',
    __DIR__
);

We’re telling Magento that our module will be named EndPoint_MyModule.

Creating the etc/module.xml information file

Now, let’s create our module information file, where we’ll specify the module version number. First, we need to create the etc directory inside app/code/EndPoint/MyModule,

mkdir app/code/EndPoint/MyModule/etc

then create module.xml with the following content:

<?xml version="1.0"?>
<config xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="urn:magento:framework:Module/etc/module.xsd">
    <module name="EndPoint_MyModule" setup_version="1.0.0">
    </module>
</config>

Installing the new module

That’s it! We have everything we need to install our new module. Now we need to tell Magento we want to install and enable our new module. So from our website root we need to run:

php bin/magento setup:upgrade

Magento will output a list of module names and configuration updates, and our new module EndPoint_MyModule should be listed in that output.

Adding a custom route to our module

Now we have a working, enabled module, but it’s not doing anything yet! What’s a simple way to check that our module is enabled? Let’s set up a custom route, so if we hit a URL like https://{our_website}/mymodule/test/helloworld we can return a custom response from a controller.

Creating a custom route will need some steps on its own:

  • Create a new directory for the controller
  • Create a etc/routes.xml file
  • Create the controller
  • Upgrade the new module

Creating a new directory for the controller

First, we need to create a new directory where the new PHP controller for our custom route will live. The new directory path should be:

  • app/code/EndPoint/MyModule/Controller

We can create as many directory levels we want, depending on our desired path. For example, if we create a class named Index in app/code/EndPoint/MyModule/Controller, the URL that will be routed to this controller will be https://{our_website}/mymodule/index (the “Controller” directory is ignored). If we create a class named HelloWorld in app/code/EndPoint/MyModule/Controller/Test, the resulting URL will be https://{our_website}/mymodule/test/helloworld.

Creating the etc/routes.xml file

routes.xml will tell Magento what base URL will be used for our module. First, we need to create the “frontend” directory where the routes.xml file needs to be placed:

mkdir app/code/EndPoint/MyModule/etc/frontend

In this example, we want the base URL to be MyModule, so we need to create an XML file inside the new directory that will route all requests made to the given URL to our module controllers:

<?xml version="1.0" ?>
<config xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="urn:magento:framework:App/etc/routes.xsd">
    <router id="standard">
        <route frontName="mymodule" id="mymodule">
            <module name="EndPoint_MyModule"/>
        </route>
    </router>
</config>

Creating the controller

If we want to respond to requests for https://{our_website}/mymodule/test/helloworld we first need to create the base Controller directory and the Test subdirectory:

mkdir app/code/EndPoint/MyModule/Controller
mkdir app/code/EndPoint/MyModule/Controller/Test

Under this directory, we’ll create our custom Magento controller. All route controllers should extend \Magento\Framework\App\Action\Action. We also need to have a public construct() method to pass the context to our ancestor and an execute() function that will be called when the URL is hit:

<?php

namespace EndPoint\MyModule\Controller\Test;

class HelloWorld extends \Magento\Framework\App\Action\Action
{

    public function __construct(
        \Magento\Framework\App\Action\Context $context
    ) {
        parent::__construct(
            $context
        );
    }

    public function execute()
    {
        echo "Hello world!";
    }

}

Upgrading the new module

We have everything in place to tell Magento we have new changes to be deployed. How we do that? First, we need to upgrade our Magento setup. But since we added a new controller that gets parameters from the dependency injector in the construct, we also need to compile the dependency injection engine (including factories, proxies, and interceptors). And finally, we need to clear the cache so new content will be served from our custom URL:

php bin/magento setup:upgrade
php bin/magento setup:di:compile
php bin/magento cache:flush

This process can take a few minutes to complete, but after it’s done we can try to reach our new custom URL. If we get a response like the one below:

Hello world!

That means our module is working!

That’s all for now. In upcoming posts, we’ll start complicating things a bit by overriding Magento classes with our custom ones and creating custom controllers that will return information from the Magento core classes. We will also explore how to customize the front-end by creating a theme. Don’t forget to add any questions, suggestions or issues in the comments below!

Installing Ubuntu 18.04 to a different partition from an existing Ubuntu installation

$
0
0

Clean setup

Photo by Patryk Grądys on Unsplash

Our Liquid Galaxy systems are running on Ubuntu 14.04 LTS (Trusty). We decided to upgrade them to Ubuntu 18.04 LTS (Bionic) since Ubuntu 14.04 LTS reached its end of life on April 30, 2019.

Upgrading from Ubuntu 14.04 LTS

The recommended way to upgrade from Ubuntu 14.04 LTS is to first upgrade to 16.04 LTS, then to 18.04 LTS, which will continue to receive support until April 2023. Ubuntu has LTS -> LTS upgrades, allowing you to skip intermediate non-LTS releases, but we can’t skip intermediate LTS releases; we have to go via 16.04, unless we want to do a fresh install of 18.04 LTS.

14.04 LTS -> 16.04 LTS -> 18.04 LTS

For a little more longevity, we decided to do a fresh install of Ubuntu 18.04 LTS. Not only is this release supported into 2023 but it will offer a direct upgrade route to Ubuntu 20.04 LTS when it’s released in April 2020.

Installing Clean Ubuntu 18.04 LTS from Ubuntu 14.04 LTS

Install debootstrap

The debootstrap utility installs a very minimal Debian system. Debootstrap will install a Debian-based OS into a sub-directory. You don’t need an installation CD for this. However, you need to have access to the corresponding Linux distribution repository (e.g. Debian or Ubuntu).

/usr/bin/apt-get update
/usr/bin/apt-get -y install debootstrap

Creating a new root partition

Create a logical volume with size 12G and format the filesystem to ext4:

/sbin/lvcreate -L12G -n ROOT_VG/ROOT_VOLUME
/sbin/mkfs.ext4 /dev/ROOT_VG/ROOT_VOLUME

Mounting the new root partition

Mount the partition at /mnt/root18. This will be the root (/) of your new system.

/bin/mkdir -p "/mnt/root18"
/bin/mount /dev/ROOT_VG/ROOT_VOLUME /mnt/root18

Bootstrapping the new root partition

Debootstrap can download the necessary files directly from the repository. You can substitute any Ubuntu archive mirror for ports.ubuntu.com/ubuntu-ports in the command example below. Mirrors are listed here.

Replace $ARCH below with your architecture: amd64, arm64, armhf, i386, powerpc, ppc64el, or s390x.

/usr/sbin/debootstrap --arch "$ARCH" "$DISTRO" "$ROOT_MOUNTPOINT”
/usr/sbin/debootstrap --arch "amd64" "bionic" "/mnt/root18"

Installing fstab

This just changes the root (/) partition path in the new installation while keeping the /boot partition intact. For example, /dev/mapper/headVG-root / -> /dev/mapper/headVG-root18 /. Since device names are not guaranteed to be the same after rebooting or when a new device is connected, we use UUIDs (Universally Unique Identifiers) to refer to partitions in fstab. We don’t need to use UUIDs for logical volumes since they can’t be duplicated.

OLD_ROOT_PATH="$(awk '$2 == "/" { print $1 }' /etc/fstab)"
/bin/sed "s:^${OLD_ROOT_PATH}\s:/dev/mapper/headVG-root18 :" /etc/fstab > "/mnt/root18/etc/fstab"

Mounting things in the new root partition

Bind /dev to the new location, then mount /sys, /proc, and /dev/pts from your host system to the target system.

/bin/mount --bind /dev "/mnt/root18/dev"
/bin/mount -t sysfs none "/mnt/root18/sys"
/bin/mount -t proc none "/mnt/root18/proc"
/bin/mount -t devpts none "/mnt/root18/dev/pts"

Configuring apt

Debootstrap will have created a very basic /mnt/root18/etc/apt/sources.list that will allow installing additional packages. However, I suggest that you add some additional sources, such as the following, for source packages and security updates:

/bin/echo "deb http://us.archive.ubuntu.com/ubuntu bionic main universe
deb-src http://us.archive.ubuntu.com/ubuntu bionic main universe
deb http://security.ubuntu.com/ubuntu bionic-security main universe
deb-src http://security.ubuntu.com/ubuntu bionic-security main universe" > /mnt/root18/etc/apt/sources.list

Make sure to run apt update with chroot after you have made changes to the target system sources list.

Now we’ve got a real Ubuntu system, if a rather small one, on disk. chroot into it to set up the base configurations.

LANG=C.UTF-8 chroot /mnt/root18 /bin/bash

Installing required packages and running chef-client

As we are maintaining most of the Liquid Galaxy configuration and packages with Chef, we need to install chef-client, configure it on the new target system, and run chef-client to complete the setup.

Copy the chef configuration and persistent net udev rules into place:

cp -a /etc/chef "/mnt/root18/etc/"
cp /etc/udev/rules.d/70-persistent-net.rules /mnt/root18/etc/udev/rules.d/

Install and run chef-client and let it set up our user login:

/bin/cat << EOF | chroot "/mnt/root18"
/usr/bin/apt-get update && /usr/bin/apt-get install -y curl wget
/usr/bin/curl -L https://omnitruck.chef.io/install.sh | /bin/bash -s -- -v 12.5.1
/usr/bin/chef-client -E production_trusty -o 'recipe[users]'
EOF

Next, chroot and install the required packages:

cat << EOF | chroot "$ROOT_MOUNTPOINT"
/bin/mount /boot
/usr/bin/apt-get update && /usr/bin/apt-get install -y --no-install-recommends linux-image-generic lvm2 openssh-server ifupdown net-tools
/usr/sbin/locale-gen en_US.UTF-8
EOF

Set Ubuntu 14.04 to boot default

Back up the current trusty kernel files into /boot/trusty and create a custom menu entry configuration for Ubuntu 14.04 on 42_custom_trusty. Update /etc/default/grub to set Ubuntu 14.04 as the default menu entry and run update-grub to apply it to the current system. This will be used as a fail-safe method to run Trusty again if there is a problem with the new installation.

mkdir -vp /boot/trusty
cp -v /boot/*-generic /boot/trusty/
sed -i 's/GRUB_DEFAULT=.*/GRUB_DEFAULT="TRUSTY"/' /etc/default/grub
update-grub

Create the custom menu entry for Ubuntu 14.04 and Ubuntu 18.04 on the target system.

mkdir -p /mnt/root18/etc/grub.d
cat 42_custom_template > /mnt/root18/etc/grub.d/42_custom_menu_entry

chroot into the target system and run update-grub. This will also update the GRUB configuration to boot Ubuntu 14.04 as default and update the 0th menu entry to Ubuntu 18.04 (Bionic)

cat << EOF | chroot "/mnt/root18"
update-grub
EOF

Boot into Bionic

To boot into Ubuntu 18.04 (Bionic), reboot the system after grub-reboot bionic and test if the bionic system is working as expected.

$ grub-reboot bionic
$ reboot

Reboot and test our new 0th GRUB entry:

$ grub-reboot 0
$ reboot

A normal reboot returns to Ubuntu 14.04 (Trusty) since the default menu entry is still set to Ubuntu 14.04 (Trusty).

Set Ubuntu 18.04 to boot default

To set our new Ubuntu 18.04 installation as the default menu entry, change GRUB_DEFAULT to 0 in /etc/default/grub and run update-grub to apply it. The next reboot will boot into Ubuntu 18.04.

sed -i 's/GRUB_DEFAULT=.*/GRUB_DEFAULT=0/‘ /etc/default/grub
update-grub

Congratulations! You now have a freshly installed Ubuntu 18.04 system.

Creating a Messaging App Using Spring for Apache Kafka, Part 1

$
0
0

spring-kafkaPhoto by Click and Learn Photography at Unsplash

Spring is a popular Java application framework. Apache Kafka is a fault-tolerant, fast, and horizontally scalable distributed stream-message broker. Spring for Apache Kafka applies the overall concepts of Spring to Java applications based on Kafka.

Since Kafka can establish a fast and fault-tolerant stream data pipeline it can be used as an orchestrator. In this article I’ll explain how to create a spring-kafka project, add dependencies and use Kafka to create a messaging app.

Initialize Spring project

Spring projects can be built from scratch using Spring Initializr. I like to keep the default options. Most Spring projects use Maven. I set the group id as com.endpoint and the artifact as SpringKafkaMessaging which makes the base package name com.endpoint.SpringKafkaMessaging.

Spring Initializr

When we are done with the initial project setup we press the “GENERATE” button to download an empty Spring Boot project in a zip file. You can then use your favorite IDE to open and start developing your project. I prefer Eclipse for Java projects. Here’s what it looks like when I open the project up:

Eclipse

I won’t address detailed configuration or adding dependencies of Spring and Maven projects in this post. If you are not familiar with Spring and Maven, I recommend that you have a look at the Spring documentation first.

Design and architecture

Before adding the dependencies, including Kafka, we need to make a high level design of this simple project and figure out how to proceed with development. Messaging apps seem simple in view but the architecture behind them can be quite complex. There are different kinds of technology stacks you can pick and move. Which base protocol we choose (XMPP, SIP, or WebSocket) depends on what our app’s aim is. Sometimes multiple protocols can be used and interconnected to provide more features; XMPP is mostly used for chatting, SIP is designed for VoIP and media transfer. We’ll use WebSocket to communicate with Kafka over TCP.

By understanding the architectural model of Kafka, you’ll get an understanding of how Kafka is going to maintain most of the backend processes.

Kafka, as I mentioned previously, is horizontally scalable, meaning that Kafka clusters can be increased to span several data sources. Basically, message producers and message consumers (all client messaging apps are both producers and consumers) are producing and consuming messages through Kafka topics.

So, taking into account the principals for designing the architecture of such a client–server-based messaging app, here are the components and their communication directions:

  • Kafka Cluster
  • Spring Boot REST API, which will handle user authentication and login
  • Persistence (here I chose PostgreSQL)
  • Cache (Redis) for fast read-write cache operations
  • WebSocket for messaging app clients

spring-kafka dependencies

After creating a model and components let’s add our dependencies to the pom.xml file to finish creating our project. Below we add spring-boot-starter, spring-boot-starter-web, spring-kafka, spring-boot-starter-jdbc, and redis.clients:jedis for the corresponding REST, Kafka, Persistent (JDBC), and Redis components.

<dependencies>
  <dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter</artifactId>
  </dependency>

  <dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-web</artifactId>
  </dependency>

  <dependency>
    <groupId>org.springframework.kafka</groupId>
    <artifactId>spring-kafka</artifactId>
  </dependency>

  <dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-jdbc</artifactId>
  </dependency>

  <dependency>
    <groupId>redis.clients</groupId>
    <artifactId>jedis</artifactId>
  </dependency>

  <dependency>
    <groupId>org.projectlombok</groupId>
    <artifactId>lombok</artifactId>
  </dependency>

  <dependency>
    <groupId>com.fasterxml.jackson.core</groupId>
    <artifactId>jackson-databind</artifactId>
  </dependency>

  <dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-test</artifactId>
    <scope>test</scope>
    <exclusions>
      <exclusion>
        <groupId>org.junit.vintage</groupId>
        <artifactId>junit-vintage-engine</artifactId>
      </exclusion>
    </exclusions>
  </dependency>

  <dependency>
      <groupId>org.springframework.kafka</groupId>
      <artifactId>spring-kafka-test</artifactId>
      <scope>test</scope>
  </dependency>
</dependencies>

To be continued in Part 2!


Migrating large PostgreSQL databases

$
0
0

Migration

Photo by Harshil Gudka on Unsplash

The challenge

One of our clients has a large and important health-related application. It’s built on an end-of-life Ruby on Rails-based open source framework, heavily customized over the years. They wanted to upgrade to a newer, supported, Java-based open source application a partner organization had developed as a replacement. Both organizations used the old system previously. To do that we would need to migrate all their existing PostgreSQL data from the old system to the new one, retaining important customizations while adapting to the new database schema.

Although there were many similarities between the old system and the new, the differences were significant enough to require careful study of the database schemas and the migration scripts designed to move the data:

  • There were schema-level differences between our old database and the partner organization’s old database.
  • Even where the two old databases were similar there were differences on the data level, such as different standards, different values in table records, different representation, etc.
  • We had different content, so if a script was working well for their data, it was not necessarily correct for us.
  • There were dynamically generated tables for both old databases and we had to find out how we can convert our current schema elements along with its records to the planned schema elements along with its records.

We had to understand the differences between our and their old databases. Due to the number of tables and average number of columns, manual comparison between databases was not really an option. We knew that the algorithm of handling the scripts would look like below:

For each S in Scripts
    Analyze S and understand the intent behind it
    Compute a read-only version of S to avoid write operations
    Execute the read-only version of S
    Analyze the results and find out whether they are different from the expected results
    Convert our read-only version of S to S′, where S′ is compatible with our expectations
    While there are technical issues do
        Fix it
    While end
    Execute S′
For end

Understanding these differences was easier said than done. The main problem is that we found it difficult to define our exact expectations. For that purpose we needed a deeper understanding about the databases.

Entity Relationship Diagram

We first needed to see an Entity Relationship Diagram (ER diagram or ERD). We used DbVisualizer for this.

We imported the database into our local RDBMS and then created a database connection by right-clicking Connections in the left menu tree

Dbvisualizer connections

then clicking on Create Database Connection and selecting No Wizard

Dbvisualizer connection wizard

and filling in the data.

Dbvisualizer connection form

After that, we double-clicked on the schema we wanted to generate an ER diagram for and clicked on Open Object.

Dbvisualizer open object

Finally, we clicked on the References tab and a bird’s-eye view of the ER diagram was generated.

Dbvizualizer references

Dbvisualizer schema birdview

Then we right-clicked on the diagram and clicked on Export.

Dbvisualizer schema export

We chose SVG and saved it. After opening the SVG diagram, we saw that the schema was too big to easily analyze, so we dropped the tables we were not specifically interested in in our local copy and generated a new ER diagram. It was super easy and cool. Finally, we were able to see which parts of the other team’s database was old and which was new. We were also able to compare our old database with the new database we were implementing.

Comparing our databases against their counterparts

Next we needed to understand what the schema differences were between their old database and our old database to determine what selections in the scripts will not work properly and to determine how we needed to modify it to fit our technical nuances.

We used Liquibase for this purpose. See Selva’s article on comparing PostgreSQL database schema versions.

The actual command we used was diff.

So, we needed to make sure we had a proper setup and then we could run the command. The example output the documentation gives is this:

Diff Results:
Reference Database: MYSCHEMA2 @ jdbc:oracle:thin:@localhost:1521:ORCL (Default Schema: MYSCHEMA2)
Comparison Database: MYSCHEMA @ jdbc:oracle:thin:@localhost:1521:ORCL (Default Schema: MYSCHEMA)
Compared Schemas: MYSCHEMA2 -> MYSCHEMA
Product Name: EQUAL
Product Version: EQUAL
Missing Catalog(s): NONE
Unexpected Catalog(s): NONE
Changed Catalog(s): NONE
Missing Check Constraint(s): NONE
Unexpected Check Constraint(s): NONE
Changed Check Constraint(s): NONE
Missing Column(s): NONE
Unexpected Column(s):
     MYSCHEMA.DEPARTMENT.ACTIVE
     MYSCHEMA.SERVICETECH.ACTIVE
     MYSCHEMA.SERVICETECH2.ACTIVE
     MYSCHEMA.SERVICETECH3.ACTIVE
     MYSCHEMA.VIEW1.ACTIVE
     MYSCHEMA.DATABASECHANGELOG.AUTHOR
     MYSCHEMA.DATABASECHANGELOG.COMMENTS
     MYSCHEMA.DATABASECHANGELOG.CONTEXTS
     MYSCHEMA.DATABASECHANGELOG.DATEEXECUTED
     MYSCHEMA.DATABASECHANGELOG.DEPLOYMENT_ID
     MYSCHEMA.DATABASECHANGELOG.DESCRIPTION
     MYSCHEMA.DATABASECHANGELOG.EXECTYPE
     MYSCHEMA.DATABASECHANGELOG.FILENAME
     MYSCHEMA.DATABASECHANGELOG.ID
     MYSCHEMA.DATABASECHANGELOGLOCK.ID
     MYSCHEMA.DEPARTMENT.ID
     MYSCHEMA.SERVICETECH.ID
     MYSCHEMA.SERVICETECH2.ID
     MYSCHEMA.SERVICETECH3.ID
     MYSCHEMA.VIEW1.ID
     MYSCHEMA.DATABASECHANGELOG.LABELS
     MYSCHEMA.DATABASECHANGELOG.LIQUIBASE
     MYSCHEMA.DATABASECHANGELOGLOCK.LOCKED
     MYSCHEMA.DATABASECHANGELOGLOCK.LOCKEDBY
     MYSCHEMA.DATABASECHANGELOGLOCK.LOCKGRANTED
     MYSCHEMA.DATABASECHANGELOG.MD5SUM
     MYSCHEMA.DEPARTMENT.NAME
     MYSCHEMA.SERVICETECH.NAME
     MYSCHEMA.SERVICETECH2.NAME
     MYSCHEMA.SERVICETECH3.NAME
     MYSCHEMA.VIEW1.NAME
     MYSCHEMA.DATABASECHANGELOG.ORDEREXECUTED
     MYSCHEMA.DATABASECHANGELOG.TAG
Changed Column(s): NONE
Missing Database Package(s): NONE
Unexpected Database Package(s): NONE
Changed Database Package(s): NONE
Missing Database Package Body(s): NONE
Unexpected Database Package Body(s): NONE
Changed Database Package Body(s): NONE
Missing Foreign Key(s): NONE
Unexpected Foreign Key(s): NONE
Changed Foreign Key(s): NONE
Missing Function(s): NONE
Unexpected Function(s): NONE
Changed Function(s): NONE
Missing Index(s): NONE
Unexpected Index(s):
     PK_DATABASECHANGELOGLOCK UNIQUE  ON MYSCHEMA.DATABASECHANGELOGLOCK(ID)
     PK_DEPARTMENT UNIQUE  ON MYSCHEMA.DEPARTMENT(ID)
     PK_SERVICETECH UNIQUE  ON MYSCHEMA.SERVICETECH(ID)
     PK_SERVICETECH2 UNIQUE  ON MYSCHEMA.SERVICETECH2(ID)
     PK_SERVICETECH3 UNIQUE  ON MYSCHEMA.SERVICETECH3(ID)
Changed Index(s): NONE
Missing Java Class(s): NONE
Unexpected Java Class(s): NONE
Changed Java Class(s): NONE
Missing Java Source(s): NONE
Unexpected Java Source(s): NONE
Changed Java Source(s): NONE
Missing Primary Key(s): NONE
Unexpected Primary Key(s):
     PK_DATABASECHANGELOGLOCK on MYSCHEMA.DATABASECHANGELOGLOCK(ID)
     PK_DEPARTMENT on MYSCHEMA.DEPARTMENT(ID)
     PK_SERVICETECH on MYSCHEMA.SERVICETECH(ID)
     PK_SERVICETECH2 on MYSCHEMA.SERVICETECH2(ID)
     PK_SERVICETECH3 on MYSCHEMA.SERVICETECH3(ID)
Changed Primary Key(s): NONE
Missing Sequence(s): NONE
Unexpected Sequence(s): NONE
Changed Sequence(s): NONE
Missing Stored Procedure(s): NONE
Unexpected Stored Procedure(s): NONE
Changed Stored Procedure(s): NONE
Missing Synonym(s): NONE
Unexpected Synonym(s): NONE
Changed Synonym(s): NONE
Missing Table(s): NONE
Unexpected Table(s):
     DATABASECHANGELOG
     DATABASECHANGELOGLOCK
     DEPARTMENT
     SERVICETECH
     SERVICETECH2
     SERVICETECH3
Changed Table(s): NONE
Missing Trigger(s): NONE
Unexpected Trigger(s): NONE
Changed Trigger(s): NONE
Missing Unique Constraint(s): NONE
Unexpected Unique Constraint(s): NONE
Changed Unique Constraint(s): NONE
Missing View(s): NONE
Unexpected View(s):
     VIEW1
Changed View(s): NONE
Liquibase command 'diff' was executed successfully.

Of course, we could do this job manually by listing all the tables with psql’s \dt and then checking each of them individually with \d tablename, but if there are many tables, this would take forever.

Yes, we can write software for this purpose, implementing an algorithm along the lines of

tables = <execute \dt>
For each (tables as table) do
    Differences[table] = difference(<execute \d table at db1>, <execute \d table at db2>)
End For

however, the algorithm above won’t handle special cases, like tables existing in db1 and not in db2 or vice versa. The algorithm above graciously outsources the arduous task of splitting the rows in both cases by identifying whether a row is a column name, an index, a foreign key, etc. and by identifying the subject of the line (e.g. column name) and finding the matches between the two to a function called difference.

It is of course implementable, but it would add a considerable amount of work. We should also mention that such a newly developed piece of code would not be well tested yet and we would have to watch out for possible bugs, create unit tests, create a nice UI or file export to ensure that we can analyse the results, and so on. All this work is unnecessary due to the availability of Liquibase and we are only talking about a single command compared to the many here.

Dynamically generated tables

In practical terms this means the data our software must manage does not fit a pre-established schema; users create and update new data collection forms regularly. These forms consist of sets of uniquely named questions and text-based answers to those questions. The PostgreSQL JSON data type may seem like a natural fit for such data. However, the original version of the software predates PostgreSQL’s now extensive JSON support. The software version from which we were upgrading stored these data in an Entity-Attribute-Value schema, a database pattern often maligned (justly) by database designers.

In this version, a single table stored all the answers given for every user-defined question for every case in the system, along with a pointer to the associated question and case. As one might expect, this table grew fairly large, though its principal drawbacks were not its size but rather the large number of joins necessary to process data it contained, and the lack of sufficient data validation. It is possible that the hstore data type would have been a better fit, however, neither programming language support for hstore data nor developer familiarity with it made it an obvious choice at the time. We did use hstore widely in the backend for data manipulation functions that could be contained entirely in SQL.

Fast forward to newer versions, where this schema has been redesigned. We weren’t involved in the design process and can’t comment on the justification behind this design decision, but the new version creates new tables within the database as needed for each data entry form, and text fields for each question on the form. This reduces the number of joins or aggregations necessary to compile all the data for one form for a single case, but it means creating SQL queries dynamically to create, and later to find, the tables and columns containing data of interest.

We’ve run our fingers through the data several times, both during and after the migration, and found neither schema variant satisfies our every wish. Both versions store users’ data as text fields, whatever data type they may represent. Some form of data validation at the database level would be very nice, and in the new version where each field has its own column in the database, this is entirely possible, though of course, it would have required more work in the development process. In particular, many questions expect answers taken from a predefined set, for which enumerated types could be a good fit. Of course, stored procedures could conceivably ensure valid data no matter its data type in the schema, but this doesn’t seem like a plausible option in practice. As a further drawback to the new approach, column and table names derive from user-defined data, meaning we need to sanitize user input to create valid PostgreSQL identifiers. This is a tricky process, and difficult to separate entirely into its own module to avoid reimplementing the same intricate logic multiple times.

JSON data types provide one possible schema alternative, with all entries for one data entry form for a single case stored in a single JSON field, and indeed the PostgreSQL documentation proposes its use in such situations. It’s not entirely clear, though, that this would be a win. We could define new keys within the JSON structure without needing to modify the database schema itself, and with JSON we’d always know exactly what table and field we needed, to find the data we were after, but we’d still need to write queries dynamically in order to pull the desired fields. We could avoid some of the data sanitization necessary to create field names, as the rules for JSON key names are far more permissive than for column names in a proper database table. But, again barring extensive stored procedures, we would still have very limited ability to validate data within the database itself, as JSON supports only a small set of primitive types.

Putting it all together

After we acquired the understanding that we needed we were able to work out the migration script according to the algorithm that we outlined at the start of this article.

This was still a long, labor-intensive task which was done by repeated pair-programming sessions but we were able to reach high enough accuracy. So high that to our great surprise we were able to start the application after the migration process was done.

Release

We were able to do the release on a weekend and the three of us moved on to solving problems submitted by beta testers. We called this process “dragon hunting”.

Dragpon


(Written with help from Selvakumar Arumugam and Joshua Tolley.)

Convenient Reporting with Jasper

$
0
0

Basalt pillars

Business Intelligence (BI) reporting is a huge problem space in custom software. There’s a wide range of business needs for looking at past and predictive behavior. Building a reporting tool can be a very cost effective way to get this data, especially compared to writing individual queries or manually generating reports.

I’ve been working with Jasper in the Java project space and wanted to write about some research I’ve collected on the topic.

JasperReports takes .jrxml files as input and outputs a .jasper report. Possible output targets include:

  • Screen
  • Printer
  • PDF
  • HTML
  • Excel files
  • RTF
  • ODT
  • CSV
  • XML

Jasper history

  • June 2001: Teodor Danciu began working on JasperReports.
  • September 2001: Jasper was registered on SourceForge.
  • November 2001: JasperReports 0.1.5 was released.
  • 2004: Panscopic teamed up with Teodor Danciu, acquired ownership of the product and changed its name to Jaspersoft.
  • 2005: JasperReports 1.0 was released.
  • 2007: Brian Gentile became CEO of the company.
  • 2014: TIBCO acquired Jaspersoft for ~$185 million.

Best reporting tools

Let’s compare some popular reporting tools:

  • JasperReports is a free and open source Java-based reporting tool, which supports lots of possible outputs, as mentioned earlier. Generating reports can be difficult if you’re less technical. More technical aspects can be more difficult as well; embedding JasperReports into a project is not necessarily simple, but once it’s done, the tool will be reliable.
  • Crystal Reports supports many inputs, including Access, Excel, XML, ODBC, and JDBC. It also has good multi-language support. It’s easy to embed into a .NET project, but software updates are unstable. The process can be very slow and there is no control of data-level security. A trial version is offered, but if one wants to use it long-term, then the price is a one-time payment of $495 or more, for larger companies.
  • Domo is another a popular reporting tool. It provides a trial version, and a 5 user plan costs $5700/year*.
  • Zoho analytics is an easy-to-use BI reporting tool, priced between $22–$445, depending on the number of users and data.
  • Host analytics is a great tool for finance automation. Pricing not publicized.
  • Tableau is an excellent reporting tool, with a thriving community online, but its quote-based price is high.
  • Pentaho is a Java-based reporting tool, which provides data integration, online analytical processing and reporting, among other features. Pentaho offers a 30-day trial period. Contract pricing isn’t disclosed.

So, if you are writing software and already use Java, or using Java reporting is an option, JasperReports is a great choice. It supports a variety of outputs, is free to use, and open source.

Installing JasperReports Server

To install JasperReports Server, you need a computer with a fully functional JRE (Java Runtime Environment). The server might be Tomcat or GlassFish. An RDBMS is also needed. since JasperReports has its own database, this could be PostgreSQL, Oracle, MySQL, DB2, or SQL Server. JasperReports prefers PostgreSQL and Tomcat, so these will be included with an automatic install. You may choose to use your existing Tomcat/PostgreSQL or have it install them as well.

Manual installation is also possible, as described here. At my first encounter with Jasper I installed Tomcat with the installer and used it for generating JasperReports, while the application I was working with was running WildFly (formerly JBoss), using a MySQL database. Needless to say, this was unnecessary, but I was not aware of that at the time. JasperServer can be configured to work with JBoss and MySQL as well.

The core of JasperReports is the JasperReports Library, which was already integrated into the project I was working with and is integrated into JasperReports Server as well as into popular IDEs, like TIBCO Jaspersoft Studio or iReport Designer.

Database

JasperReports provides example databases for imaginary companies, like FoodMart:

Foodmart

As we can see these are normal tables, having a primary key and some other fields.

.jrxml

jrxml, which stands for Jasper XML, contains report definitions in XML format. This type of file can be edited as code or visually to be compiled into .jasper files. The community provides samples that can be used and understood. Among them we can find a JFreeChart sample, which contains a jrxml file, a preview HTML and an actual PDF. The .jrxml file looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE jasperReport PUBLIC "-//JasperReports//DTD Report Design//EN" "http://jasperreports.sourceforge.net/dtds/jasperreport.dtd">

<jasperReport name="JFreeChartReport" pageWidth="595" pageHeight="842" columnWidth="515" leftMargin="40" rightMargin="40" topMargin="50" bottomMargin="50" scriptletClass="JFreeChartScriptlet">
    <variable name="Chart" class="net.sf.jasperreports.engine.JRRenderable" calculation="System"/>
    <title>
        <band height="742">
            <line>
                <reportElement x="0" y="0" width="515" height="1"/>
                <graphicElement/>
            </line>
            <staticText>
                <reportElement x="0" y="10" width="515" height="30"/>
                <textElement textAlignment="Center">
                    <font size="22"/>
                </textElement>
                <text><![CDATA[JFreeChart Sample]]></text>
            </staticText>
            <textField>
                <reportElement x="0" y="50" width="515" height="50"/>
                <textElement textAlignment="Center">
                    <font size="12"/>
                </textElement>
                <textFieldExpression class="java.lang.String"><![CDATA["This sample uses JFreeChart Version 1.0.0-pre2\n" + "Written by David Gilbert (david.gilbert@object-refinery.com) and others.\n" + "(C)opyright 2000-2004, by Object Refinery Limited and Contributors."]]></textFieldExpression>
            </textField>
            <image scaleImage="Clip" hAlign="Center" hyperlinkType="Reference">
                <reportElement x="0" y="110" width="515" height="300"/>
                <graphicElement/>
                <imageExpression class="net.sf.jasperreports.engine.JRRenderable"><![CDATA[$V{Chart}]]></imageExpression>
                <hyperlinkReferenceExpression><![CDATA["http://www.jfree.org/jfreechart"]]></hyperlinkReferenceExpression>
            </image>
        </band>
    </title>
</jasperReport>

It starts with the xml tag, specifying that this file should be interpreted as XML. Then comes the DOCTYPE and finally the jasperReport node, which contains the actual report nodes. A variable is defined, called Chart, which is used later in the inner XML of the image node. A hyperlink is defined for the image. The preview for this report looks like this:

Preview

Don’t worry about the broken images; this is just the preview, the actual result looks like this:

Piechart

Nice, isn’t it?

Data source

It’s nice to generate reports, but in many cases the content is not fully known at programming time. It’s quite possible that we need to provide some input for the template. For this purpose, the interface of JRDataSource was defined to be iterated with a .next() method and is readable via the .getFieldValue() method. To make sure that we can read fields, another interface, JRField was defined as well. We therefore will need to use implementations of these classes, possibly writing our own if the currently available implementations are not fulfilling our needs.

Let’s consider a datasource sample, also taken from the community. It has this .jrxml template:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE jasperReport PUBLIC "-//JasperReports//DTD Report Design//EN" "http://jasperreports.sourceforge.net/dtds/jasperreport.dtd">

<jasperReport name="DataSourceReport" pageWidth="595" pageHeight="842" columnWidth="515" leftMargin="40" rightMargin="40" topMargin="50" bottomMargin="50">
    <style name="Arial_Normal" isDefault="true" fontName="Arial" fontSize="12" isBold="false" isItalic="false" isUnderline="false" isStrikeThrough="false" pdfFontName="Helvetica" pdfEncoding="Cp1252" isPdfEmbedded="false"/>
    <style name="Arial_Bold" isDefault="false" fontName="Arial" fontSize="12" isBold="true" isItalic="false" isUnderline="false" isStrikeThrough="false" pdfFontName="Helvetica-Bold" pdfEncoding="Cp1252" isPdfEmbedded="false"/>
    <style name="Arial_Italic" isDefault="false" fontName="Arial" fontSize="12" isBold="false" isItalic="true" isUnderline="false" isStrikeThrough="false" pdfFontName="Helvetica-Oblique" pdfEncoding="Cp1252" isPdfEmbedded="false"/>
    <parameter name="ReportTitle" class="java.lang.String"/>
    <parameter name="DataFile" class="java.lang.String"/>
    <field name="id" class="java.lang.Integer"/>
    <field name="name" class="java.lang.String"/>
    <field name="street" class="java.lang.String"/>
    <field name="the_city" class="java.lang.String">
        <fieldDescription>me.me.city</fieldDescription>
    </field>
    <variable name="CityNumber" class="java.lang.Integer" incrementType="Group" incrementGroup="CityGroup" calculation="Count">
        <variableExpression><![CDATA[Boolean.TRUE]]></variableExpression>
    </variable>
    <group name="CityGroup" minHeightToStartNewPage="60">
        <groupExpression><![CDATA[$F{the_city}]]></groupExpression>
        <groupHeader>
        <band height="20">
            <textField evaluationTime="Group" evaluationGroup="CityGroup" bookmarkLevel="1">
                <reportElement mode="Opaque" x="0" y="5" width="515" height="15" backcolor="#c0c0c0" style="Arial_Bold"/>
                <box leftPadding="10" bottomBorder="1Point"/>
                <textFieldExpression class="java.lang.String"><![CDATA["  " + String.valueOf($V{CityNumber}) + ". " + String.valueOf($F{the_city})]]></textFieldExpression>
                <anchorNameExpression><![CDATA[String.valueOf($F{the_city})]]></anchorNameExpression>
            </textField>
        </band>
        </groupHeader>
        <groupFooter>
        <band height="20">
            <staticText>
                <reportElement x="400" y="1" width="60" height="15" style="Arial_Bold"/>
                <textElement textAlignment="Right"/>
                <text><![CDATA[Count :]]></text>
            </staticText>
            <textField>
                <reportElement x="460" y="1" width="30" height="15" style="Arial_Bold"/>
                <textElement textAlignment="Right"/>
                <textFieldExpression class="java.lang.Integer"><![CDATA[$V{CityGroup_COUNT}]]></textFieldExpression>
            </textField>
        </band>
        </groupFooter>
    </group>
    <title>
        <band height="70">
            <line>
                <reportElement x="0" y="0" width="515" height="1"/>
                <graphicElement/>
            </line>
            <textField isBlankWhenNull="true" bookmarkLevel="1">
                <reportElement x="0" y="10" width="515" height="30" style="Arial_Normal"/>
                <textElement textAlignment="Center">
                    <font size="22"/>
                </textElement>
                <textFieldExpression class="java.lang.String"><![CDATA[$P{ReportTitle}]]></textFieldExpression>
                <anchorNameExpression><![CDATA["Title"]]></anchorNameExpression>
            </textField>
            <textField isBlankWhenNull="true">
                <reportElement x="0" y="40" width="515" height="20" style="Arial_Normal"/>
                <textElement textAlignment="Center">
                    <font size="14"/>
                </textElement>
                <textFieldExpression class="java.lang.String"><![CDATA[$P{DataFile}]]></textFieldExpression>
            </textField>
        </band>
    </title>
    <pageHeader>
        <band height="20">
            <rectangle>
                <reportElement x="0" y="5" width="515" height="15" forecolor="#333333" backcolor="#333333"/>
                <graphicElement/>
            </rectangle>
            <staticText>
                <reportElement mode="Opaque" x="0" y="5" width="55" height="15" forecolor="#ffffff" backcolor="#333333" style="Arial_Bold"/>
                <textElement textAlignment="Center"/>
                <text><![CDATA[ID]]></text>
            </staticText>
            <staticText>
                <reportElement mode="Opaque" x="55" y="5" width="205" height="15" forecolor="#ffffff" backcolor="#333333" style="Arial_Bold"/>
                <text><![CDATA[Name]]></text>
            </staticText>
            <staticText>
                <reportElement mode="Opaque" x="260" y="5" width="255" height="15" forecolor="#ffffff" backcolor="#333333" style="Arial_Bold"/>
                <text><![CDATA[Street]]></text>
            </staticText>
        </band>
    </pageHeader>
    <detail>
        <band height="15">
            <textField bookmarkLevel="2">
                <reportElement x="0" y="0" width="50" height="15"/>
                <box leftBorder="Thin" bottomBorder="Thin" leftPadding="10" rightPadding="10"/>
                <textElement textAlignment="Right"/>
                <textFieldExpression class="java.lang.Integer"><![CDATA[$F{id}]]></textFieldExpression>
                <anchorNameExpression><![CDATA[$F{name} + " (" + $F{id} + ")"]]></anchorNameExpression>
            </textField>
            <textField isStretchWithOverflow="true">
                <reportElement positionType="Float" x="50" y="0" width="200" height="15"/>
                <box leftBorder="Thin" bottomBorder="Thin" leftPadding="10" rightPadding="10"/>
                <textElement/>
                <textFieldExpression class="java.lang.String"><![CDATA[$F{name}]]></textFieldExpression>
            </textField>
            <textField isStretchWithOverflow="true">
                <reportElement positionType="Float" x="250" y="0" width="265" height="15"/>
                <box leftBorder="Thin" bottomBorder="Thin" rightBorder="Thin" leftPadding="10" rightPadding="10"/>
                <textElement/>
                <textFieldExpression class="java.lang.String"><![CDATA[$F{street}]]></textFieldExpression>
            </textField>
        </band>
    </detail>
    <pageFooter>
        <band height="40">
            <line>
                <reportElement x="0" y="10" width="515" height="1"/>
                <graphicElement/>
            </line>
            <textField>
                <reportElement x="200" y="20" width="80" height="15"/>
                <textElement textAlignment="Right"/>
                <textFieldExpression class="java.lang.String"><![CDATA["Page " + String.valueOf($V{PAGE_NUMBER}) + " of"]]></textFieldExpression>
            </textField>
            <textField evaluationTime="Report">
                <reportElement x="280" y="20" width="75" height="15"/>
                <textElement/>
                <textFieldExpression class="java.lang.String"><![CDATA[" " + String.valueOf($V{PAGE_NUMBER})]]></textFieldExpression>
            </textField>
        </band>
    </pageFooter>
    <lastPageFooter>
        <band height="60">
            <textField bookmarkLevel="1">
                <reportElement x="0" y="10" width="515" height="15"/>
                <textElement textAlignment="Center"/>
                <textFieldExpression class="java.lang.String"><![CDATA["There were " +
                    String.valueOf($V{REPORT_COUNT}) +
                    " address records on this report."]]></textFieldExpression>
                <anchorNameExpression><![CDATA["Summary"]]></anchorNameExpression>
            </textField>
            <line>
                <reportElement x="0" y="30" width="515" height="1"/>
                <graphicElement/>
            </line>
            <textField>
                <reportElement x="200" y="40" width="80" height="15"/>
                <textElement textAlignment="Right"/>
                <textFieldExpression class="java.lang.String"><![CDATA["Page " + String.valueOf($V{PAGE_NUMBER}) + " of"]]></textFieldExpression>
            </textField>
            <textField evaluationTime="Report">
                <reportElement x="280" y="40" width="75" height="15"/>
                <textElement/>
                <textFieldExpression class="java.lang.String"><![CDATA[" " + String.valueOf($V{PAGE_NUMBER})]]></textFieldExpression>
            </textField>
        </band>
    </lastPageFooter>
</jasperReport>

As we can see, there are fields defined like id, name, street, and the_city. We also have a group called CityGroup, so when the items from the data source are iterated through, their group is known via the_city. It’s worth looking at how the paging works. The key is evaluationTime, which is telling the engine to not evaluate a given element at iteration time, but rather when an event occurs. evaluationTime="Report" means that we need to evaluate the value when the Report event occurs. At that time $V{PAGE_NUMBER} already has the value equal to the number of pages. Let’s see the preview:

Page1

Page2

Again, we don’t need to worry about the missing image icons, since this is only a preview and this is the actual result:

Prod page1

Prod page2

Since we have a few interfaces that we need to respect, we can easily integrate Hibernate with our Jasper reports as data source, we just need to make sure we are using the field and data source interfaces they defined. Here we have a few examples, notably the following:

List cats = session.find("from eg.Cat");

Map parameters = new HashMap();
parameters.put("Title", "The Cat Report");

InputStream reportStream = this.class.getResourceAsStream("/the-cat-report.xml");
JasperDesign jasperDesign = JasperManager.loadXmlDesign(reportStream);
JasperReport jasperReport = JasperManager.compileReport(jasperDesign);

JRBeanCollectionDataSource ds = new JRBeanCollectionDataSource(cats);
JasperPrint jasperPrint = JasperManager.fillReport(jasperReport, parameters, ds);

JasperManager.printReportToPdfFile(jasperPrint, "the-cat-report.pdf");

We gather a List of Cat instances, then define some parameters, like Title, followed by the creation of an InputStream object which will be used to generate a JasperDesign object, which is then the input for the instantiation of JasperReport. Now, we define cats to be our data source and call fillReport, passing the jasperReport object we have just created, the parameters, which contain the title and the data source. Finally, we print the report to a PDF. Note that you can use a compiled .jasper file as input for getResourceAsStream.

TIBCO Jaspersoft Studio

This IDE is excellent for compiling the jrxml files into jasper files. I use it as a desktop application. There are three tabs:

  • Design
  • Source
  • Preview

In the Design tab one can interactively design the report, not worrying about source code, XML, and the like, which enables non-programmers to work on report creation as well. Handling the variables, parameters and fields needs some algorithmic understanding, but nonetheless, code is generated interactively in a point-and-click manner. The designer cannot do everything we need, but the Source tab comes to the rescue if needed. Sometimes I do not even need to use the Source tab.

There’s still one problem: testing is a high-cost operation in the project I am using Jasper for. We need to build the .jasper files, but that’s low-cost. The higher cost is that we need to build the application that actually generates the Jasper reports in order to test and then deploy it with JBoss. Then the actual test in the application occurs, so a test takes more than a minute. Luckily there is a Preview tab, where I can more or less see whether it’s a good idea to invest time into building and deploying, or if we need to do some tweaking first.

Long story short

For more information, see Jaspersoft’s website.

JasperReports is free and open source. However, you have to pay a fee to consult the help documentation. It’s easily a competitor in quality with other reporting tools, which are less reasonably priced. When I first had to work with Jasper reports, I didn’t know anything about it, but was able to complete the tasks at hand in a few hours, which shows that getting into the Jasper universe isn’t too hard. One may have difficulty understanding .jrxml files at first, but trust me, it’s worth it.

If you are already using Java and want to generate reports, Jasper is a good candidate. It can periodically generate reports for you and even send them attached to an email to you. I couldn’t write better closing words than a poster I found:

Breaking news

* Domo plan pricing was found at yurbi.com.

Creating a Messaging App Using Spring for Apache Kafka, Part 2

$
0
0

Spring pasture

This article is part of a series.

In this part I’ll walk through Kafka’s servers and processes, the basics of spring-kafka producers and consumers, persistence, and caching configurations.

Kafka Servers

Kafka uses Apache ZooKeeper as the distributed coordination server. You can download the Apache Kafka with ZooKeeper bundle here.

When you download and untar the Kafka bundle Kafka’s console scripts can be found in the bin directory. To enable Kafka connectivity and prepare the Kafka configuration let’s start the Kafka servers and see how to create Kafka topics and test console producers and consumers.

ZooKeeper

To start ZooKeeper with the default properties run the following command:

bin/zookeeper-server-start.sh config/zookeeper.properties

Kafka Server

A single Kafka server with the default properties can be started with following command:

bin/kafka-server-start.sh config/server.properties

Kafka Topics

Creating Kafka Topics

Let’s create a test Kafka topic:

bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic myTestTopic

List Topics

To list all previously created Kafka topics:

bin/kafka-topics.sh --list --bootstrap-server localhost:9092

Start a Producer

To start a console producer run the following command and send some messages from console:

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic myTestTopic
> This is a message
> This is another message

Start a Consumer

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic myTestTopic --from-beginning

When you run the consumer on the console with the from-beginning parameter you’ll see all the messages sent previously shown in the console.

Here we ran Kafka as a single server. You’ll need to optimize and scale the Kafka clusters for production and large-scale distributed systems. So far, we’ve become familiar with some Kafka components but for further Kafka configuration you can refer to the corresponding tutorials.

spring-kafka Configuration

Consumer Configuration

In the Spring Boot project let’s put the lines below in the application.properties to configure the Spring Kafka consumer:

#Consumer
spring.kafka.consumer.bootstrap-servers=localhost:9092
spring.kafka.consumer.group-id=foo
spring.kafka.consumer.auto-offset-reset=earliest
spring.kafka.consumer.key-deserializer=org.apache.kafka.common.serialization.StringDeserializer
spring.kafka.consumer.value-deserializer=org.apache.kafka.common.serialization.StringDeserializer

A simple Kafka consumer is defined as a Spring @KafkaListener annotated method like this:

@Configuration
public class MyKafkaConsumer {

    @KafkaListener(topics = "myTestTopic")
    public void listenTopic(ConsumerRecord<String, String> kafkaMessage) {
        System.out.print(String.format("Received a message: %s", kafkaMessage.value()));
    }

}

We are going to define different Kafka consumer methods listening to different topics for different purposes in our messaging app.

Producer Configuration

For producer configuration let’s add the following lines in the application.properties in our Spring Kafka project.

#Producer
spring.kafka.producer.bootstrap-servers=localhost:9092
spring.kafka.producer.key-serializer=org.apache.kafka.common.serialization.StringSerializer
spring.kafka.producer.value-serializer=org.apache.kafka.common.serialization.StringSerializer

A very simple Kafka producer could be configured like below. Spring KafkaTemplate provides a producer model and methods for sending messages to specified Kafka topics.

@Configuration
public class MyKafkaProducer {

    @Autowired
    private KafkaTemplate<String, String> kafkaTemplate;

    public void sendMessage(String topic, String message) {
        System.out.println(String.format("Message is being sent to topic %s", topic));
        kafkaTemplate.send(topic, message);
    }

}

So far we have configured Kafka in a Spring Boot project and seen simple consumer and producer examples. Before going further with Kafka configuration, let’s configure the persistence and cache repositories.

Persistence Configuration

As I mentioned in the first part of this blog series, I’m going to use PostgreSQL as a persistence environment and Spring data configuration will be like below in the application.properties:

spring.datasource.url=jdbc:postgresql://localhost:5432/epmessagingdb
spring.datasource.username=epmessaging
spring.datasource.password=epmessagingdb_password
spring.datasource.driver-class-name=org.postgresql.Driver
spring.datasource.configuration.maximum-pool-size=30
spring.jpa.database-platform=PostgreSQL
# The SQL dialect makes Hibernate generate better SQL for the chosen database
spring.jpa.properties.hibernate.dialect=org.hibernate.dialect.PostgreSQLDialect
# Hibernate ddl auto (create, create-drop, validate, update)
spring.jpa.hibernate.ddl-auto=none

In the properties we set the spring.jpa.hibernate.ddl-autoSpring JPA property to none to avoid Hibernate populating the schema automatically. In some cases it can be useful to allow auto-population. We leave the base configuration here for now as it is, in the next part we’ll create our Spring Data models in the project.

Caching Configuration

I also mentioned that we’re going to use Redis as the cache environment. Redis is developed using C and a very fast in-memory cache.

Let’s put the following lines in application.properties to enable Redis configuration in our Spring Kafka project.

cache.redis.host=localhost
cache.redis.port=6379
cache.redis.timeout=5000
cache.redis.password=

Redis Pooling Factory

We’re going to use Jedis as the Redis client in our project. So let’s create a Jedis pooling factory class in our project called JedisFactory like below:

package com.endpoint.SpringKafkaMessaging.cache;

import org.springframework.beans.factory.annotation.Value;
import redis.clients.jedis.Jedis;
import redis.clients.jedis.JedisPool;
import redis.clients.jedis.JedisPoolConfig;

public class JedisFactory {

    @Value("${cache.redis.host}")
    private static String host;

    @Value("${cache.redis.port}")
    private static Integer port;

    @Value("${cache.redis.timeout}")
    private static Integer timeout;

    @Value("${cache.redis.password}")
    private static String password;

    // hide the constructor
    private JedisFactory() {

    }

    private static JedisPool jedisPool;

    static {
        JedisPoolConfig poolConfig = new JedisPoolConfig();
        poolConfig.setMaxTotal(128);

        jedisPool = new JedisPool(
            poolConfig,
            host,
            port,
            timeout,
            password
        );
    }

    public static Jedis getConnection() {
        return jedisPool.getResource();
    }
}

We’ll create a persistence model, repository, controllers, and a cache repository in the next part of this blog series.

Shopify Admin API: Importing Products in Bulk

$
0
0

Cash RegisterPhoto by Chris Young, used under CC BY-SA 2.0, cropped from original.

I recently worked on an interesting project for a store owner who was facing a daunting task: he had an inventory of hundreds of thousands of Magic: The Gathering (MTG) cards that he wanted to sell online through his Shopify store. The logistics of tracking down artwork and current market pricing for each card made it impossible to do manually.

My solution was to create a custom Rails application that retrieves inventory data from a combination of APIs and then automatically creates products for each card in Shopify. The resulting project turned what would have been a months- or years-long task into a bulk upload that only took a few hours to complete and allowed the store owner to immediately start selling his inventory online. The online store launch turned out to be even more important than initially expected due to current closures of physical stores.

Application Requirements

The main requirements for the Rails application were:

  • Retrieving product data for MTG cards by merging results from a combination of sources/APIs
  • Mapping card attributes and metadata into the format expected by the Shopify Admin API for creating Product records
  • Performing a bulk push of products to Shopify

There were some additional considerations like staying within rate limits for both the inventory data and Shopify APIs, but I will address those further in a follow-up post.

Retrieving Card Artwork and Pricing

I ended up using a combination of two APIs to retrieve MTG card data: MTGJSON for card details like the name of the card and the set it belonged to, and Scryfall for retrieving card images and current market pricing. It was relatively easy to combine the two because MTGJSON provided Scryfall IDs for all of its records, allowing me to merge results from the two APIs together.

Working With the Shopify Admin API in Ruby

The Shopify Admin API deals in terms of generic Product records with predefined attributes like title and product_type. The official shopify_api Ruby gem made it very easy to connect to my client’s Shopify store and create new products by creating Shopify::Product objects with a hash of attributes like so:

  attrs = {
    images: [{ src: scryfall_card.image_uris.large }],
    options: [
      {
        name: 'Card Type'
      },
      {
        name: 'Condition'
      }
    ],
    product_type: 'MTG Singles',
    tags: card.setCode,
    title: card.name,
    variants: [
      {
        inventory_management: 'shopify',
        inventory_quantity: 1,
        option1: 'Foil',
        option2: 'Like New',
        price: scryfall_card.prices.usd_foil
      }
    ]
  }

  Shopify::Product.new(attrs).save

The actual production code is a bit more complicated to account for outliers like cards with multiple “faces” and cards that come in both regular and foil variants, but the example above shows the basic shape of the attributes expected by the Shopify API.

Pushing 50,000 Products to Shopify

After I completed testing with individual products and confirmed the ability to take a specific card and turn it into a Shopify product with artwork and pricing pre-populated it was time to perform the full upload of all 50,000+ cards in the MTGJSON database. I decided to use Sidekiq and create jobs for each card upload so that I could rate limit the workers to stay within rate limits for both the Scryfall and Shopify APIs, and also have persistence that would allow me to pause/resume the queue or retry individual failed jobs.

The Sidekiq approach to queueing up all of the card uploads worked great; I was able to use the Sidekiq dashboard to monitor the queue of 50,000 jobs as it worked its way through each card, and was able to see the Shopify products being created on the store in real time. Once the inventory was in place in Shopify the store owner was able to start updating his inventory levels and make cards available for sale via the Shopify Admin.

Conclusion

A custom Ruby application using the Shopify API is a powerful solution for online storefronts that need to retrieve a large number of inventory data from external sources. I was pleased with how this project turned out; it was rewarding to create a custom application that leveraged several APIs and automated a task that would have been extremely repetitive, and probably impossibly time-consuming, to do manually. It was encouraging to do my first upload of a card and see it show up on the Shopify store with artwork, pricing, and card details pre-populated.

The development model used for this project could be applied to stores in a wide variety of markets. This project used external APIs to retrieve product information but that data source could easily be replaced with a spreadsheet, CSV file, or some other export file containing bulk information on products to be sold.

Creating a Messaging App Using Spring for Apache Kafka, Part 3

$
0
0

Spring-Kafka

Photo by Pascal Debrunner on Unsplash

This article is part of a series.

In this article we’ll create the persistence and cache models and repositories. We’re also going to create our PostgreSQL database and the basic schema that we’re going to map to the persistence model.

Persistence

Database

We are going to keep the persistence model as simple as possible so we can focus on the overall functionality. Let’s first create our PostgreSQL database and schema. Here is the list of tables that we’re going to create:

  • users: will hold the users who are registered to use this messaging service.
  • access_token: will hold the unique authentication tokens per session. We’re not going to implement an authentication and authorization server specifically in this series but rather will generate a simple token and store it in this table.
  • contacts: will hold relationships of existing users.
  • messages: will hold messages sent to users.

Let’s create our tables:

CREATE TABLE kafkamessaging.users (
    user_id BIGSERIAL PRIMARY KEY,
    fname VARCHAR(32) NOT NULL,
    lname VARCHAR(32) NOT NULL,
    mobile VARCHAR(32) NOT NULL,
    created_at DATE NOT NULL
);

CREATE TABLE kafkamessaging.access_token (
    token_id BIGSERIAL PRIMARY KEY, 
    token VARCHAR(256) NOT NULL,
    user_id BIGINT NOT NULL REFERENCES kafkamessaging.users(user_id),
    created_at DATE NOT NULL
);

CREATE TABLE kafkamessaging.contacts (
    contact_id BIGSERIAL PRIMARY KEY,
    user_id BIGINT NOT NULL REFERENCES kafkamessaging.users(user_id),
    contact_user_id BIGINT NOT NULL REFERENCES kafkamessaging.users(user_id),
);

CREATE TABLE kafkamessaging.messages (
    message_id BIGSERIAL PRIMARY KEY,
    from_user_id BIGINT NOT NULL REFERENCES kafkamessaging.users(user_id),
    to_user_id BIGINT NOT NULL REFERENCES kafkamessaging.users(user_id),
    message VARCHAR(512) NOT NULL,
    sent_at DATE NOT NULL
);

Model

Before creating the models we’ll add another dependency called Lombok in pom.xml as shown below. Lombok provides very helpful annotations which automatically create getters, setters and many other parts of a class.

<dependency>
        <groupId>org.projectlombok</groupId>
        <artifactId>lombok</artifactId>
    </dependency>

So here are the persistent model classes of the corresponding tables we created in the database. Notice the Lombok and javax.persistence annotations in the model classes:

User

package com.endpoint.SpringKafkaMessaging.persistent.model;

import java.io.Serializable;
import java.util.Date;
import java.util.Set;

import javax.persistence.CascadeType;
import javax.persistence.Column;
import javax.persistence.Entity;
import javax.persistence.FetchType;
import javax.persistence.GeneratedValue;
import javax.persistence.GenerationType;
import javax.persistence.Id;
import javax.persistence.OneToMany;
import javax.persistence.Table;

import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Data;
import lombok.NoArgsConstructor;

@Entity
@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
@Table(name="users")
public class User implements Serializable {

    @Id
    @GeneratedValue(strategy = GenerationType.AUTO)
    @Column(name="user_id")
    private Long userId;

    @Column
    private String fname;

    @Column
    private String lname;

    @Column
    private String mobile;

    @Column(name="created_at")
    private Date createdAt;

    @OneToMany(mappedBy = "User", fetch = FetchType.EAGER,
            cascade = CascadeType.ALL)
    private Set<Contact> contacts;

}

AccessToken

package com.endpoint.SpringKafkaMessaging.persistent.model;

import java.io.Serializable;
import java.util.Date;
import java.util.Map;
import java.util.UUID;

import javax.persistence.Column;
import javax.persistence.Entity;
import javax.persistence.GeneratedValue;
import javax.persistence.GenerationType;
import javax.persistence.Id;
import javax.persistence.Table;

import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Data;
import lombok.NoArgsConstructor;

@Entity
@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
@Table(name="access_token")
public class AccessToken implements Serializable {

    @Id
    @GeneratedValue(strategy = GenerationType.AUTO)
    private Long token_id;

    @Column(name="token")
    private String token;

    @Column(name="user_id")
    private Long userId;

    @Column(name="created_at")
    private Date createdAt;

}

Contact

package com.endpoint.SpringKafkaMessaging.persistent.model;

import java.io.Serializable;

import javax.persistence.Column;
import javax.persistence.Entity;
import javax.persistence.FetchType;
import javax.persistence.GeneratedValue;
import javax.persistence.GenerationType;
import javax.persistence.Id;
import javax.persistence.JoinColumn;
import javax.persistence.ManyToOne;
import javax.persistence.Table;

import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Data;
import lombok.NoArgsConstructor;

@Entity
@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
@Table(name="contacts")
public class Contact implements Serializable {

    @Id
    @GeneratedValue(strategy = GenerationType.AUTO)
    @Column(name="contact_id")
    private Long contactId;

    @Column(name="user_id")
    private Long userId;

    @Column(name="contact_user_id")
    private Long contactUserId;

    @ManyToOne(fetch = FetchType.LAZY, optional = false)
    @JoinColumn(name = "user_id", nullable = false)
    private User user;
}

Message

package com.endpoint.SpringKafkaMessaging.persistent.model;

import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Data;
import lombok.NoArgsConstructor;

import java.io.Serializable;
import java.util.Date;

import javax.persistence.Column;
import javax.persistence.Entity;
import javax.persistence.GeneratedValue;
import javax.persistence.GenerationType;
import javax.persistence.Id;
import javax.persistence.Table;

@Entity
@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
@Table(name="messages")
public class Message implements Serializable {

    @Id
    @GeneratedValue(strategy = GenerationType.AUTO)
    @Column(name="message_id")
    private String messageId;

    @Column(name="from_user_id")
    private Long fromUserId;

    @Column(name="to_user_id")
    private Long toUserId;

    @Column(name="message")
    private String message;

    @Column(name="sent_at")
    private Date sentAt;

}

Note also that we didn’t use underscores in the class field names for the corresponding table field names like userId for user_id.

We’re going to use Spring’s CrudRepository interface to create our data repositories. CrudRepository can use keywords to automatically create logic using the given interface method names. Underscores are reserved characters, and even though you can still escape using double underscore in the CrudRepository method names, it doesn’t look good. I chose to use camel case, which also complies with Java convention.

Repository

Now let’s add the corresponding persistent repositories for each data model:

UserRepository

package com.endpoint.SpringKafkaMessaging.persistent.repository;

import java.util.List;

import org.springframework.data.repository.CrudRepository;
import org.springframework.stereotype.Repository;

import com.endpoint.SpringKafkaMessaging.persistent.model.User;

@Repository
public interface UserRepository extends CrudRepository<User, Long> {

    List<User> findAll();

    User findByUserId(Long userId);

    User findByMobile(String mobile);

    User findByFname(String fname);

    User findByLname(String lname);

    void deleteById(Long userId);

}

ContactRepository

package com.endpoint.SpringKafkaMessaging.persistent.repository;

import java.util.List;

import org.springframework.data.repository.CrudRepository;
import org.springframework.stereotype.Repository;

import com.endpoint.SpringKafkaMessaging.persistent.model.Contact;

@Repository
public interface ContactRepository extends CrudRepository<Contact, Long> {

    List<Contact> findAllByUserId(Long userId);

    Contact findByContactUserId(Long contactUserId);

    void deleteByContactUserId(Long contactUserId);
}

AccessTokenRepository

package com.endpoint.SpringKafkaMessaging.persistent.repository;
import org.springframework.data.repository.CrudRepository;
import org.springframework.stereotype.Repository;

import com.endpoint.SpringKafkaMessaging.persistent.model.AccessToken;

@Repository
public interface AccessTokenRepository extends CrudRepository<AccessToken, Long> {

    AccessToken findByUserId(Long userId);

    void deleteByUserId(Long userId);

}

MessageRepository

package com.endpoint.SpringKafkaMessaging.persistent.repository;

import org.springframework.data.repository.CrudRepository;
import org.springframework.stereotype.Repository;

import com.endpoint.SpringKafkaMessaging.persistent.model.Message;
@Repository
public interface MessageRepository extends CrudRepository<Message, Long> {

}

Cache

We’re not going to integrate the cache environment as Spring persistent data, so we won’t be using the CrudRepository implementation for the cache repository. Instead, let’s create the cache repository interface and create an implementation of it. Caching is going to be used for quick activation and authentication responses. To achieve this we’re going to store and query simple key-value pairs with Redis.

Repository

CacheRepository

package com.endpoint.SpringKafkaMessaging.cache.respository;

public interface CacheRepository {

    void putAccessToken(String token, String userId);

    String getUserIdByAccessToken(String token);

    void putActivationCode(String mobile, String activationCode);

    String queryMobileActivationCode(String mobile, String activationCode);
}

Since the business logic of this interface is not automatically created by Spring Boot, we need to create our own logic in Spring’s @Service like below.

CacheRepositoryImpl

package com.endpoint.SpringKafkaMessaging.cache.respository;

import org.springframework.stereotype.Service;

import com.endpoint.SpringKafkaMessaging.cache.JedisFactory;

import redis.clients.jedis.Jedis;

@Service
public class CacheRepositoryImpl implements CacheRepository {

    @Override
    public void putAccessToken(String token, String userId) {

        try (Jedis jedis = JedisFactory.getConnection()) {

            jedis.set(token, userId);

        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    @Override
    public String getUserIdByAccessToken(String token) {

        try (Jedis jedis = JedisFactory.getConnection()) {

            return jedis.get(token);

        } catch (Exception e) {
            e.printStackTrace();
        }

        return null;
    }

    @Override
    public void putActivationCode(String mobile, String activationCode) {

        try (Jedis jedis = JedisFactory.getConnection()) {

            jedis.hset(mobile, String.valueOf(activationCode), "");
            jedis.expire(mobile, 15 * 60);

        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    @Override
    public String queryMobileActivationCode(String mobile, String code) {

        try (Jedis jedis = JedisFactory.getConnection()) {

            return jedis.hget(mobile, code);
        } catch (Exception e) {
            e.printStackTrace();
        }

        return null;
    }
}

Activation and Authentication

Activation is a one-time process to activate a mobile number for our messaging service client. After an activation our simple authentication service will provide an access token to messaging client, and this access token will be used for future client logins. To achieve these simple processes let’s create our authentication service interface.

AuthService

package com.endpoint.SpringKafkaMessaging.auth;

public interface AuthService {
    void putAccessToken(String code, Long userId);

    void loginWithAccessToken(String mobile, String code);
}

AuthServiceImpl

package com.endpoint.SpringKafkaMessaging.auth;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;

import com.endpoint.SpringKafkaMessaging.cache.respository.CacheRepository;
import com.endpoint.SpringKafkaMessaging.persistent.model.AccessToken;
import com.endpoint.SpringKafkaMessaging.persistent.repository.AccessTokenRepository;

import java.util.Calendar;

@Service
public class AuthServiceImpl implements AuthService {

    @Autowired
    CacheRepository cacheRepository;

    @Autowired
    AccessTokenRepository accessTokenRepository;

    @Override
    public void putAccessToken(String token, Long userId) {

        // store token in the cache
        cacheRepository.putAccessToken(token, String.valueOf(userId));

        // store token in the persistence
        AccessToken accessToken = AccessToken.builder()
                                    .token(token)
                                    .userId(userId)
                                    .createdAt(Calendar.getInstance().getTime())
                                    .build();
        accessTokenRepository.save(accessToken);
    }

    @Override
    public void loginWithAccessToken(String mobile, String code) {
        // TODO
    }
}

We won’t implement a complex auth server here.

Let’s look at the draft form of the authentication controller below. The authentication controller here simulates the mobile number activation and one time login with the activation code and provides a unique access token to client. To achieve this I defined activation request and response models.

ActivationRequest

package com.endpoint.SpringKafkaMessaging.auth.controller;

import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Data;
import lombok.NoArgsConstructor;

@Builder
@Data
@AllArgsConstructor
@NoArgsConstructor
public class ActivationRequest {

    private String mobile;

}

ActivationResponse

package com.endpoint.SpringKafkaMessaging.auth.controller;

import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Data;
import lombok.NoArgsConstructor;

@Builder
@Data
@AllArgsConstructor
@NoArgsConstructor
public class ActivationResponse {

    private String mobile;

    private String activationCode;

}

LoginRequest

package com.endpoint.SpringKafkaMessaging.auth.controller;

import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Data;
import lombok.NoArgsConstructor;

@Builder
@Data
@AllArgsConstructor
@NoArgsConstructor
public class LoginRequest {

    private String mobile;

    private String activationCode;

}

AuthController

package com.endpoint.SpringKafkaMessaging.auth.controller;

import javax.validation.Valid;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.http.MediaType;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestMethod;
import org.springframework.web.bind.annotation.RestController;

import com.endpoint.SpringKafkaMessaging.auth.AuthService;
import com.endpoint.SpringKafkaMessaging.cache.respository.CacheRepository;
import com.endpoint.SpringKafkaMessaging.persistent.model.User;
import com.endpoint.SpringKafkaMessaging.persistent.repository.UserRepository;
@RestController
@RequestMapping("/api/auth")
public class AuthController {

    @Autowired
    UserRepository userRepository;

    @Autowired
    AuthService authService;

    @Autowired
    CacheRepository cacheRepository;

    @RequestMapping(value = "/getcode", method = RequestMethod.POST, consumes = MediaType.APPLICATION_JSON_VALUE, produces = MediaType.APPLICATION_JSON_UTF8_VALUE)
    public ResponseEntity<Object> getCode(@Valid @RequestBody ActivationRequest activationRequest) {

        // TODO

        return null;
    }

    @RequestMapping(value = "/login", method = RequestMethod.POST, consumes = MediaType.APPLICATION_JSON_UTF8_VALUE, produces = MediaType.APPLICATION_JSON_UTF8_VALUE)
    public ResponseEntity<String> login(@RequestBody LoginRequest loginRequest) {

        // TODO

        return null;
    }
}

In the next chapter we’ll shape and complete the authentication service and controller and add message sender and receiver services. We’ll also configure and enable Spring WebSocket.

In the final chapter, we’ll create a simple web app interface as a messaging client to test our spring-kafka messaging application.

Designing Flexible CI pipelines with Jenkins and Docker

$
0
0

Pipes

Photo by Tian Kuan on Unsplash

When deciding on how to implement continuous integration (CI) for a new project, you are presented with lots of choices. Whatever you end up choosing, your CI needs to work for you and your team. Keeping the CI process and its mechanisms clear and concise helps everyone working on the project. The setup we are currently employing, and what I am going to showcase here, has proven to be flexible and powerful. Specifically, I’m going to highlight some of the things Jenkins and Docker do that are really helpful.

Jenkins

Jenkins provides us with all the CI functionality we need and it can be easily configured to connect to projects on GitHub and our internal GitLab. Jenkins has support for something it calls a multibranch pipeline. A Jenkins project follows a repo and builds any branch that has a Jenkinsfile. A Jenkinsfile configures an individual pipeline that Jenkins runs against a repo on a branch, tag or merge request (MR). To keep it even simpler, we condense the steps that a Jenkinsfile runs into shell scripts that live in /scripts/ at the root of the source repo to do things like test or build or deploy (/scripts/test.sh, etc). If a team member wants to know how the tests are run, it is right in the /scripts/test.sh file to reference.

The Jenkinsfile can be written in a declarative syntax or in plain Groovy. We have landed on the scripted Groovy syntax for its more fine-grained control of Dockers. Jenkins also provides several ways to inspect and debug the pipelines with things like “Replay” in its GUI and using input(‘wait here’) in a pipeline to debug a troublesome step. The input() function is especially useful when paired with Docker. The function allows us to pause the job and go to the Jenkins server where we use docker ps to find the running container’s name. Then we use docker exec -it {container name} bash to debug inside of the container with all of the Jenkins environment variables loaded. This has proven to be a great way to figure out why something isn’t working in our test stages.

Docker

We love using Docker for our development and deployment for a variety of reasons. First, creating a Dockerfile for a project is essentially an exercise in figuring out how a project is built with a minimum of dependencies. Once a Docker container is built, the running container provides a great place to run tests as it is a clean checkout with little to no extra cruft. Using our Jenkins pipeline, we can take builds triggered by tags and push an associated tagged Docker image up to our registry. With Docker’s layering, pushes are often the shortest stage of the Jenkins job. Deploying that tag is as simple as doing a docker pull on the target system. For the application deployment, we create a basic docker-compose.yml to start and serve the project from within the container, forwarding whatever ports we need on the local system.

Example Jenkinsfile

Let’s take a look at a basic scripted Jenkinsfile (scripted in Groovy) that utilizes a Dockerfile in the source repo to build, test, and deploy a project:

node() {
  properties([gitLabConnection('gitlab-connect')])

  def vueImage
  def dockerTagName

  stage('Checkout') {
     checkout scm
  }

  stage('Build') {
    vueImage = docker.build("endpoint/vue-test")
  }
  vueImage.inside('-u 0') {
    stage('Test') {
      sh './scripts/test.sh'
    }
  }

  stage('Tag/Push') {
    docker.withRegistry('https://registry.hub.docker.com', 'ep_dockerhub_creds') {
      if (env.TAG_NAME != null) {
       vueImage.push("${env.TAG_NAME}")
      } else {
          vueImage.push("${env.BRANCH_NAME}")
      }
    }
  }
}

The script’s first stage, Checkout, checks out the repo using our gitlab-connect credentials that are stored on the Jenkins server. It then moves to the Build stage where it builds the image using the Dockerfile in our repo and names it after the org/repo it will use on DockerHub. Then, inside of the running container we enter the Test stage where we run the repo script ./scripts/test.sh. After the .inside code block is closed the running container is stopped and removed. Finally, we get to the Tag/Push stage where we push our Docker image up to DockerHub using another set of stored credentials. We tag it with either the TAG_NAME or the BRANCH_NAME.

This Jenkinsfile provides us with a solid base to expand on. During development as requirements change, it’s easy to modify and update the Jenkinsfile. We have the ability to run steps inside and outside of the Docker. Combined with bash scripts that live in the repo, we can do almost anything. Most of the job mechanics can be tuned, down to the specific status updates GitLab receives during a run. Say we want to handle a push a bit differently if the branch is named Master or we want to add another stage and break out the Test stage into Unit Tests and E2E Tests. These things are easily changed in the Jenkinsfile and then run on Jenkins when pushed. There’s no need to merge to see the pipeline change. Every branch/​tag/​MR has its own pipeline. Deploying the Docker you just built is easy; just use your TAG_NAME or BRANCH_NAME with docker pull endpoint/vue-test:{}.

Conclusion

Although the above script is just an example script, the Jenkinsfiles we use in production are not far off from this in functionality and the ideas remain the same. Jenkins is not the easiest to configure as some of the required functionality comes from plugins, and getting the correct combination of plugins can be a challenge. That being said, the functionality it provides paired with Docker is amazing and definitely worth considering when setting up CI for a new project.

Implementing SummAE neural text summarization with a denoising auto-encoder

$
0
0

Book open on lawn with dandelions

If there’s any problem space in machine learning, with no shortage of (unlabelled) data to train on, it’s easily natural language processing (NLP).

In this article, I’d like to take on the challenge of taking a paper that came from Google Research in late 2019 and implementing it. It’s going to be a fun trip into the world of neural text summarization. We’re going to go through the basics, the coding, and then we’ll look at what the results actually are in the end.

The paper we’re going to implement here is: Peter J. Liu, Yu-An Chung, Jie Ren (2019) SummAE: Zero-Shot Abstractive Text Summarization using Length-Agnostic Auto-Encoders.

Here’s the paper’s abstract:

We propose an end-to-end neural model for zero-shot abstractive text summarization of paragraphs, and introduce a benchmark task, ROCSumm, based on ROCStories, a subset for which we collected human summaries. In this task, five-sentence stories (paragraphs) are summarized with one sentence, using human summaries only for evaluation. We show results for extractive and human baselines to demonstrate a large abstractive gap in performance. Our model, SummAE, consists of a denoising auto-encoder that embeds sentences and paragraphs in a common space, from which either can be decoded. Summaries for paragraphs are generated by decoding a sentence from the paragraph representations. We find that traditional sequence-to-sequence auto-encoders fail to produce good summaries and describe how specific architectural choices and pre-training techniques can significantly improve performance, outperforming extractive baselines. The data, training, evaluation code, and best model weights are open-sourced.

Preliminaries

Before we go any further, let’s talk a little bit about neural summarization in general. There’re two main approaches to it:

The first approach makes the model “focus” on the most important parts of the longer text - extracting them to form a summary.

Let’s take a recent article, “Shopify Admin API: Importing Products in Bulk”, by one of my great co-workers, Patrick Lewis, as an example and see what the extractive summarization would look like. Let’s take the first two paragraphs:

I recently worked on an interesting project for a store owner who was facing a daunting task: he had an inventory of hundreds of thousands of Magic: The Gathering (MTG) cards that he wanted to sell online through his Shopify store. The logistics of tracking down artwork and current market pricing for each card made it impossible to do manually.

My solution was to create a custom Rails application that retrieves inventory data from a combination of APIs and then automatically creates products for each card in Shopify. The resulting project turned what would have been a months- or years-long task into a bulk upload that only took a few hours to complete and allowed the store owner to immediately start selling his inventory online. The online store launch turned out to be even more important than initially expected due to current closures of physical stores.

An extractive model could summarize it as follows:

I recently worked on an interesting project for a store owner who had an inventory of hundreds of thousands of cards that he wanted to sell through his store. The logistics and current pricing for each card made it impossible to do manually. My solution was to create a custom Rails application that retrieves inventory data from a combination of APIs and then automatically creates products for each card. The store launch turned out to be even more important than expected due to current closures of physical stores.

See how it does the copying and pasting? The big advantage of these types of models is that they are generally easier to create and the resulting summaries tend to faithfully reflect the facts included in the source.

The downside though is that it’s not how a human would do it. We do a lot of paraphrasing, for instance. We use different words and tend to form sentences less rigidly following the original ones. The need for the summaries to feel more natural made the second type — abstractive — into this subfield’s holy grail.

Datasets

The paper’s authors used the so-called “ROCStories” dataset (“Tackling The Story Ending Biases in The Story Cloze Test”. Rishi Sharma, James Allen, Omid Bakhshandeh, Nasrin Mostafazadeh. In Proceedings of the 2018 Conference of the Association for Computational Linguistics (ACL), 2018).

In my experiments, I’ve also tried the model against one that’s quite a bit more difficult: WikiHow (Mahnaz Koupaee, William Yang Wang (2018) WikiHow: A Large Scale Text Summarization Dataset).

ROCStories

The dataset consists of 98162 stories, each one consisting of 5 sentences. It’s incredibly clean. The only step I needed to take was to split the stories between the train, eval, and test sets.

Examples of sentences:

Example 1:

My retired coworker turned 69 in July. I went net surfing to get her a gift. She loves Diana Ross. I got two newly released cds and mailed them to her. She sent me an email thanking me.

Example 2:

Tom alerted the government he expected a guest. When she didn’t come he got in a lot of trouble. They talked about revoking his doctor's license. And charging him a huge fee! Tom's life was destroyed because of his act of kindness.

Example 3:

I went to see the doctor when I knew it was bad. I hadn't eaten in nearly a week. I told him I felt afraid of food in my body. He told me I was developing an eating disorder. He instructed me to get some help.

Wikihow

This is one of the most challenging openly available datasets for neural summarization. It consists of more than 200,000 long-sequence pairs of text + headline scraped from WikiHow’s website.

Some examples:

Text:

One easy way to conserve water is to cut down on your shower time. Practice cutting your showers down to 10 minutes, then 7, then 5. Challenge yourself to take a shorter shower every day. Washing machines take up a lot of water and electricity, so running a cycle for a couple of articles of clothing is inefficient. Hold off on laundry until you can fill the machine. Avoid letting the water run while you're brushing your teeth or shaving. Keep your hoses and faucets turned off as much as possible. When you need them, use them sparingly.

Headline:

Take quicker showers to conserve water. Wait for a full load of clothing before running a washing machine. Turn off the water when you're not using it.

The main challenge for the summarization model here is that the headline was actually created by humans and is not just “extracting” anything. Any model performing well on this dataset actually needs to model the language pretty well. Otherwise, the headline could be used for computing the evaluation metrics, but it’s pretty clear that traditional metrics like ROUGE are just bound here to miss the point.

Basics of the sequence-to-sequence modeling

Most sequence-to-sequence models are based on the “next token prediction” workflow.

The general idea can be expressed with P(token | context) — where the task is to model this conditional probability distribution. The “context” here depends on the approach.

Those models are also called “auto-regressive” because they need to consume their own predictions from previous steps during the inference:

predict(["<start>"], context)
# "I"
predict(["<start>", "I"], context)
# "love"
predict(["<start>", "I", "love"], context)
# "biking"
predict(["<start>", "I", "love", "biking"], context)
# "<end>"

Naively simple modeling: Markov Model

In this model, the approach is to take on a bold assumption: that the probability of the next token is conditioned only on the previous token.

The Markov Model is elegantly introduced in the blog post Next Word Prediction using Markov Model.

Why is it naive? Because we know that the probability of the word “love” depends on the word “I” given a broader context. A model that’s always going to output “roses” would miss the best word more often than not.

Modeling with neural networks

Usually, sequence-to-sequence neural network models consist of two parts:

  • encoder
  • decoder

The encoder is there to build a “gist” representation of the input sequence. The gist and the previous token become our “context” to do the inference. This fits in well within the P(token | context) modeling I described above. That distribution can be expressed more clearly as P(token | previous; gist).

There are other approaches too with one of them being the ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training - 2020 - Yan, Yu and Qi, Weizhen and Gong, Yeyun and Liu, Dayiheng and Duan, Nan and Chen, Jiusheng and Zhang, Ruofei and Zhou, Ming. The difference in the approach here was the prediction of n-tokens ahead at once.

Teacher-forcing

Let’s see how could we go about teaching the model about the next token’s conditional distribution.

Imagine that the model’s parameters aren’t performing well yet. We have an input sequence of: ["<start>", "I", "love", "biking", "during", "the", "summer", "<end>"]. We’re training the model giving it the first token:

model(["<start>", context])
# "I"

Great, now let’s ask it for another one:

model(["<start>", "I"], context])
# "wonder"

Hmmm that’s not what we wanted, but let’s naively continue:

model(["<start>", "I", "wonder"], context)
# "why"

We could continue gathering predictions and compute the loss at the end. The loss would really only be able to tell it about the first mistake (“love” vs. “wonder”); the rest of the errors would just accumulate from here. This would hinder the learning considerably, adding in the noise from the accumulated errors.

There’s a better approach called Teacher Forcing. In this approach, you’re telling the model the true answer after each of its guesses. The last example would look like the following:

model(["<start>", "I", "love"], context)
# "watching"

You’d continue the process, feeding it the full input sequence and the loss term would be computed based on all its guesses.

Compute-friendly representation for tokens and gists

Some of the readers might want to skip this section. I’d like to describe quickly here the concept of the latent space and vector embeddings. This is to keep the matters relatively palatable for the broader audience.

Representing words naively

How do we turn the words (strings) into numbers that we input into our machine learning models? A software developer might think about assigning each word a unique integer. This works well for databases but in machine learning models, the fact that integers follow one another means that they encode a relation (which one follows which and in what distance). This doesn’t work well for almost any problem in data science.

Traditionally, the problem is solved by “one-hot encoding”. This means that we’re turning our integers into vectors, where each value is zero except the one for the index that equals the value to encode (or minus one if your programming language uses zero-based indexing). Example: 3 => [0, 0, 0, 1, 0, 0, 0, 0, 0, 0] when the total number of “integers” (classes) to encode is 10.

This is better as it breaks the ordering and distancing assumptions. It doesn’t encode anything about the words, though, except the arbitrary number we’ve decided to assign to them. We now don’t have the ordering but we also don’t have any distance. Empirically though we just know that the word “love” is much closer to “enjoy” than it is to “helicopter”.

A better approach: word embeddings

How could we keep our vector representation (as in one-hot encoding) but also introduce the distance? I’ve already glanced over this concept in my post about the simple recommender system. The idea is to have a vector of floating-point values so that the closer the words are in their meaning, the smaller the angle is between them. We can easily compute a metric following this logic by measuring the cosine distance. This way, the word representations are easy to feed into the encoder, and they already contain a lot of the information in themselves.

Not only words

Can we only have vectors for words? Couldn’t we have vectors for paragraphs, so that the closer they are in their meaning, the smaller some vector space metric between them? Of course we can. This is, in fact, what will allow us in this article’s model to encode the “gist” that we talked about. The “encoder” part of the model is going to learn the most convenient way of turning the input sequence into the floating-point numbers vector.

Auto-encoders

We’re slowly approaching the model from the paper. We still have one concept that’s vital to understand in order to get why the model is going to work.

Up until now, we talked about the following structure of the typical sequence-to-sequence neural network model:

Sequence To Sequence Neural Nets

This is true e.g. for translation models where the input sequence is in English and the output is in Greek. It’s also true for this article’s model during the inference.

What if we’d make the input and output to be the same sequence? We’d turn it into a so-called auto-encoder.

The output of course isn’t all that useful — we already know what the input sequence is. The true value is in the model’s ability to encode the input into a gist.

Adding the noise

A very interesting type of an auto-encoder is the denoising auto-encoder. The idea is that the input sequence gets randomly corrupted and the network learns to still produce a good gist and reconstruct the sequence before it got corrupted. This makes the training “teach” the network about the deeper connections in the data, instead of just “memorizing” as much as it can.

The SummAE model

We’re now ready to talk about the architecture from the paper. Given what we’ve already learned, this is going to be very simple. The SummAE model is just a denoising auto-encoder that is being trained a special way.

Auto-encoding paragraphs and sentences

The authors were training the model on both single sentences and full paragraphs. In all cases the task was to reproduce the uncorrupted input.

The first part of the approach is about having two special “start tokens” to signal the mode: paragraph vs. sentence. In my code, I’ve used “<start-full>” and “<start-short>”.

During the training, the model learns the conditional distributions given those two tokens and the ones that follow, for any given token in the sequence.

Adding the noise

The sentences are simply concatenated to form a paragraph. The input then gets corrupted at random by means of:

  • masking the input tokens
  • shuffling the order of the sentences within the paragraph

The authors are claiming that the latter helped them in solving the issue of the network just memorizing the first sentence. What I have found though is that this model is generally prone towards memorizing concrete sentences from the paragraph. Sometimes it’s the first, and sometimes it’s some of the others. I’ve found this true even when adding a lot of noise to the input.

The code

The full PyTorch implementation described in this blog post is available at https://github.com/kamilc/neural-text-summarization. You may find some of its parts less clean than others — it’s a work in progress. Specifically, the data download is almost left out.

You can find the WikiData preprocessing in a notebook in the repository. For the ROCStories, I just downloaded the CSV files and concatenated with Unix cat. There’s an additional process.py file generated from a very simple IPython session.

Let’s have a very brief look at some of the most interesting parts of the code:

class SummarizeNet(NNModel):
    def encode(self, embeddings, lengths):
        # ...

    def decode(self, embeddings, encoded, lengths, modes):
        # ...

    def forward(self, embeddings, clean_embeddings, lengths, modes):
        # ...

    def predict(self, vocabulary, embeddings, lengths):
        # ...

You can notice separate methods for forward and predict. I chose the Transformer over the recurrent neural networks for both the encoder part and the decoder. The PyTorch implementation of the transformer decoder part already includes the teacher forcing in the forward method. This makes it convenient at the training time — to just feed it the full, uncorrupted sequence of embeddings as the “target”. During the inference we need to do the “auto-regressive” part by hand though. This means feeding the previous predictions in a loop — hence the need for two distinct methods here.

def forward(self, embeddings, clean_embeddings, lengths, modes):
    noisy_embeddings = self.mask_dropout(embeddings, lengths)

    encoded = self.encode(noisy_embeddings[:, 1:, :], lengths-1)
    decoded = self.decode(clean_embeddings, encoded, lengths, modes)

    return (
        decoded,
        encoded
    )

You can notice that I’m doing the token masking at the model level during the training. The code also shows cleanly the structure of this seq2seq model — with the encoder and the decoder.

The encoder part looks simple as long as you’re familiar with the transformers:

def encode(self, embeddings, lengths):
    batch_size, seq_len, _ = embeddings.shape

    embeddings = self.encode_positions(embeddings)

    paddings_mask = torch.arange(end=seq_len).unsqueeze(dim=0).expand((batch_size, seq_len)).to(self.device)
    paddings_mask = (paddings_mask + 1) > lengths.unsqueeze(dim=1).expand((batch_size, seq_len))

    encoded = embeddings.transpose(1,0)

    for ix, encoder in enumerate(self.encoders):
        encoded = encoder(encoded, src_key_padding_mask=paddings_mask)
        encoded = self.encode_batch_norms[ix](encoded.transpose(2,1)).transpose(2,1)

    last_encoded = encoded

    encoded = self.pool_encoded(encoded, lengths)

    encoded = self.to_hidden(encoded)

    return encoded

We’re first encoding the positions as in the “Attention Is All You Need” paper and then feeding the embeddings into a stack of the encoder layers. At the end, we’re morphing the tensor to have the final dimension equal the number given as the model’s parameter.

The decode sits on PyTorch’s shoulders too:

def decode(self, embeddings, encoded, lengths, modes):
    batch_size, seq_len, _ = embeddings.shape

    embeddings = self.encode_positions(embeddings)

    mask = self.mask_for(embeddings)

    encoded = self.from_hidden(encoded)
    encoded = encoded.unsqueeze(dim=0).expand(seq_len, batch_size, -1)

    decoded = embeddings.transpose(1,0)
    decoded = torch.cat(
        [
            encoded,
            decoded
        ],
        axis=2
    )
    decoded = self.combine_decoded(decoded)
    decoded = self.combine_batch_norm(decoded.transpose(2,1)).transpose(2,1)

    paddings_mask = torch.arange(end=seq_len).unsqueeze(dim=0).expand((batch_size, seq_len)).to(self.device)
    paddings_mask = paddings_mask > lengths.unsqueeze(dim=1).expand((batch_size, seq_len))

    for ix, decoder in enumerate(self.decoders):
        decoded = decoder(
            decoded,
            torch.ones_like(decoded),
            tgt_mask=mask,
            tgt_key_padding_mask=paddings_mask
        )
        decoded = self.decode_batch_norms[ix](decoded.transpose(2,1)).transpose(2,1)

    decoded = decoded.transpose(1,0)# [:, 0:(decoded.shape[0] - 1), :]

    return self.linear_logits(decoded)

You can notice that I’m combining the gist received from the encoder with each word embeddings — as this is how it was described in the paper.

The predict is very similar to forward:

def predict(self, vocabulary, embeddings, lengths):
    """
    Caller should include the start and end tokens here
    but we’re going to ensure the start one is replaces by <start-short>
    """
    previous_mode = self.training

    self.eval()

    batch_size, _, _ = embeddings.shape

    results = []

    for row in range(0, batch_size):
        row_embeddings = embeddings[row, :, :].unsqueeze(dim=0)
        row_embeddings[0, 0] = vocabulary.token_vector("<start-short>")

        encoded = self.encode(
            row_embeddings[:, 1:, :],
            lengths[row].unsqueeze(dim=0)
        )

        results.append(
            self.decode_prediction(
                vocabulary,
                encoded,
                lengths[row].unsqueeze(dim=0)
            )
        )

    self.training = previous_mode

    return results

The workhorse behind the decoding at the inference time looks as follows:

def decode_prediction(self, vocabulary, encoded1xH, lengths1x):
    tokens = ['<start-short>']
    last_token = None
    seq_len = 1

    encoded1xH = self.from_hidden(encoded1xH)

    while last_token != '<end>' and seq_len < 50:
        embeddings1xSxD = vocabulary.embed(tokens).unsqueeze(dim=0).to(self.device)
        embeddings1xSxD = self.encode_positions(embeddings1xSxD)

        maskSxS = self.mask_for(embeddings1xSxD)

        encodedSx1xH = encoded1xH.unsqueeze(dim=0).expand(seq_len, 1, -1)

        decodedSx1xD = embeddings1xSxD.transpose(1,0)
        decodedSx1xD = torch.cat(
            [
                encodedSx1xH,
                decodedSx1xD
            ],
            axis=2
        )
        decodedSx1xD = self.combine_decoded(decodedSx1xD)
        decodedSx1xD = self.combine_batch_norm(decodedSx1xD.transpose(2,1)).transpose(2,1)

        for ix, decoder in enumerate(self.decoders):
            decodedSx1xD = decoder(
                decodedSx1xD,
                torch.ones_like(decodedSx1xD),
                tgt_mask=maskSxS,
            )
            decodedSx1xD = self.decode_batch_norms[ix](decodedSx1xD.transpose(2,1))
            decodedSx1xD = decodedSx1xD.transpose(2,1)

        decoded1x1xD = decodedSx1xD.transpose(1,0)[:, (seq_len-1):seq_len, :]
        decoded1x1xV = self.linear_logits(decoded1x1xD)

        word_id = F.softmax(decoded1x1xV[0, 0, :]).argmax().cpu().item()
        last_token = vocabulary.words[word_id]
        tokens.append(last_token)
        seq_len += 1

    return ' '.join(tokens[1:])

You can notice starting with the “start short” token and going in a loop, getting predictions, and feeding back until the “end” token.

Again, the model is very, very simple. What makes the difference is how it’s being trained — it’s all in the training data corruption and the model pre-training.

It’s already a long article so I encourage the curious readers to look at the code at my GitHub repo for more details.

My experiment with the WikiHow dataset

In my WikiHow experiment I wanted to see how the results look if I fed the full articles and their headlines for the two modes of the network. The same data-corruption regime was used in this case.

Some of the results were looking almost good:

Text:

for a savory flavor, mix in 1/2 teaspoon ground cumin, ground turmeric, or masala powder.this works best when added to the traditional salty lassi. for a flavorful addition to the traditional sweet lassi, add 1/2 teaspoon of ground cardamom powder or ginger, for some kick. , start with a traditional sweet lassi and blend in some of your favorite fruits. consider mixing in strawberries, papaya, bananas, or coconut.try chopping and freezing the fruit before blending it into the lassi. this will make your drink colder and frothier. , while most lassi drinks are yogurt based, you can swap out the yogurt and water or milk for coconut milk. this will give a slightly tropical flavor to the drink. or you could flavor the lassi with rose water syrup, vanilla extract, or honey.don’t choose too many flavors or they could make the drink too sweet. if you stick to one or two flavors, they’ll be more pronounced. , top your lassi with any of the following for extra flavor and a more polished look: chopped pistachios sprigs of mint sprinkle of turmeric or cumin chopped almonds fruit sliver

Headline:

add a spice., blend in a fruit., flavor with a syrup or milk., garnish.

Predicted summary:

blend vanilla in a sweeter flavor . , add a sugary fruit . , do a spicy twist . eat with dessert . , revise .

It’s not 100% faithful to the original text even though it seems to “read” well.

My suspicion is that pre-training against a much larger corpus of text might possibly help. There’s an obvious issue with the lack of very specific knowledge here to have the network summarize better. Here’s another of those examples:

Text:

the settings app looks like a gray gear icon on your iphone's home screen.; , this option is listed next to a blue "a" icon below general. , this option will be at the bottom of the display & brightness menu. , the right-hand side of the slider will give you bigger font size in all menus and apps that support dynamic type, including the mail app. you can preview the corresponding text size by looking at the menu texts located above and below the text size slider. , the left-hand side of the slider will make all dynamic type text smaller, including all menus and mailboxes in the mail app. , tap the back button twice in the upper-left corner of your screen. it will save your text size settings and take you back to your settings menu. , this option is listed next to a gray gear icon above display & brightness. , it's halfway through the general menu. ,, the switch will turn green. the text size slider below the switch will allow for even bigger fonts. , the text size in all menus and apps that support dynamic type will increase as you go towards the right-hand side of the slider. this is the largest text size you can get on an iphone. , it will save your settings.

Headline:

open your iphone's settings., scroll down and tap display & brightness., tap text size., tap and drag the slider to the right for bigger text., tap and drag the slider to the left for smaller text., go back to the settings menu., tap general., tap accessibility., tap larger text. , slide the larger accessibility sizes switch to on position., tap and drag the slider to the right., tap the back button in the upper-left corner.

Predicted summary:

open your iphone 's settings . , tap general . , scroll down and tap accessibility . , tap larger accessibility . , tap and larger text for the iphone to highlight the text you want to close . , tap the larger text - colored contacts app .

It might be interesting to train against this dataset again while:

  • utilizing some pre-trained, large scale model as part of the encoder
  • using a large corpus of text to still pre-train the auto-encoder

This could possibly take a lot of time to train on my GPU (even with the pre-trained part of the encoder). I didn’t follow the idea further at this time.

The problem with getting paragraphs when we want the sentences

One of the biggest problems the authors ran into was with the decoder outputting the long version of the text, even though it was asked for the sentence-long summary.

Authors called this phenomenon the “segregation issue”. What they have found was that the encoder was mapping paragraphs and sentences into completely separate regions. The solution to this problem was to trick the encoder into making both representations indistinguishable. The following figure comes from the paper and shows the issue visualized:

Segregation problem

Better gists by using the “critic”

The idea of a “critic” has been popularized along with the fantastic results produced by some of the Generative Adversarial Networks. The general workflow is to have the main network generate output while the other tries to guess some of its properties.

For GANs that are generating realistic photos, the critic is there to guess if the photo was generated or if it’s real. A loss term is added based on how well it’s doing, penalizing the main network for generating photos that the critic is able to call out as fake.

A similar idea was used in the A3C algorithm I blogged about (Self-driving toy car using the Asynchronous Advantage Actor-Critic algorithm). The “critic” part penalized the AI agent for taking steps that were on average less advantageous.

Here, in the SummAE model, the critic adds a penalty to the loss to the degree to which it’s able to guess whether the gist comes from a paragraph or a sentence.

Training with the critic might get tricky. What I’ve found to be the cleanest way is to use two different optimizers — one updating the main network’s parameters while the other updates the critic itself:

for batch in batches:
    if mode == "train":
        self.model.train()
        self.discriminator.train()
    else:
        self.model.eval()
        self.discriminator.eval()

    self.optimizer.zero_grad()
    self.discriminator_optimizer.zero_grad()

    logits, state = self.model(
        batch.word_embeddings.to(self.device),
        batch.clean_word_embeddings.to(self.device),
        batch.lengths.to(self.device),
        batch.mode.to(self.device)
    )

    mode_probs_disc = self.discriminator(state.detach())
    mode_probs = self.discriminator(state)

    discriminator_loss = F.binary_cross_entropy(
        mode_probs_disc,
        batch.mode
    )

    discriminator_loss.backward(retain_graph=True)

    if mode == "train":
        self.discriminator_optimizer.step()

    text = batch.text.copy()

    if self.no_period_trick:
        text = [txt.replace('.', '') for txt in text]

    classes = self.vocabulary.encode(text, modes=batch.mode)
    classes = classes.roll(-1, dims=1)
    classes[:,classes.shape[1]-1] = 3

    model_loss = torch.tensor(0).cuda()

    if logits.shape[0:2] == classes.shape:
        model_loss = F.cross_entropy(
            logits.reshape(-1, logits.shape[2]).to(self.device),
            classes.long().reshape(-1).to(self.device),
            ignore_index=3
        )
    else:
        print("WARNING: Skipping model loss for inconsistency between logits and classes shapes")

    fooling_loss = F.binary_cross_entropy(
        mode_probs,
        torch.ones_like(batch.mode).to(self.device)
    )

    loss = model_loss + (0.1 * fooling_loss)

    loss.backward()
    if mode == "train":
        self.optimizer.step()

    self.optimizer.zero_grad()
    self.discriminator_optimizer.zero_grad()

The main idea is to treat the main network’s encoded gist as constant with respect to the updates to the critic’s parameters, and vice versa.

Results

I’ve found some of the results look really exceptional:

Text:

lynn is unhappy in her marriage. her husband is never good to her and shows her no attention. one evening lynn tells her husband she is going out with her friends. she really goes out with a man from work and has a great time. lynn continues dating him and starts having an affair.

Predicted summary:

lynn starts dating him and has an affair .

Text:

cedric was hoping to get a big bonus at work. he had worked hard at the office all year. cedric's boss called him into his office. cedric was disappointed when told there would be no bonus. cedric's boss surprised cedric with a big raise instead of a bonus.

Predicted summary:

cedric had a big deal at his boss 's office .

Some others showed how the model attends to single sentences though:

Text:

i lost my job. i was having trouble affording my necessities. i didn't have enough money to pay rent. i searched online for money making opportunities. i discovered amazon mechanical turk.

Predicted summary:

i did n't have enough money to pay rent .

While the sentence like this one would maybe make a good headline — it’s definitely not the best summary as it naturally loses the vital parts found in other sentences.

Final words

First of all, let me thank the paper’s authors for their exceptional work. It was a great read and great fun implementing!

Abstractive text summarization remains very difficult. The model trained for this blog post has very limited use in practice. There’s a lot of room for improvement though, which makes the future of abstractive summaries very promising.


Testing to defend against nginx add_header surprises

$
0
0

Cute calico cat perched securely upon a trepidatious shoe

These days when hosting websites it is common to configure the web server to send several HTTP response headers with every single request for security purposes.

For example, using the nginx web server we may add these directives to our http configuration scope to apply to everything served, or to specific server configuration scopes to apply only to particular websites we serve:

add_header Strict-Transport-Security max-age=2592000 always;
add_header X-Content-Type-Options    nosniff         always;

(See HTTP Strict Transport Security and X-Content-Type-Options at MDN for details about these two particular headers.)

The surprise (problem)

Once upon a time I ran into a case where nginx usually added the expected HTTP response headers, but later appeared to be inconsistent and sometimes did not. This is distressing!

Troubleshooting leads to the (re-)discovery that add_header directives are not always additive throughout the configuration as one would expect, and as every other server I can think of typically does.

If you define your add_header directives in the http block and then use an add_header directive in a server block, those from the http block will disappear.

If you define some add_header directives in the server block and then add another add_header directive in a location block, those from the http and/or server blocks will disappear.

This is even the case in an if block.

In the nginx add_header documentation we find the reason for the behavior explained:

There could be several add_header directives. These directives are inherited from the previous level if and only if there are no add_header directives defined on the current level.

This nginx directive has always behaved this way. Various people have warned about it in blog posts and online discussions for many years. But the situation remains the same, a trap for the unwary.

I have tried to imagine the rationale behind this behavior. Response headers often are set in groups, so the programmer who created this feature may have decided that any new scope’s add_header directives should start with a clean slate, unaffected by those set elsewhere. Hmm. The need for exclusive grouping of response headers is rare in my experience, and adding headers to the existing stack of tentative response headers is far more commonly what I want.

So while this behavior may make sense somewhere, it has not ever done so for me or anyone I have talked to about it. For us it is simply misbehavior, silent and easy to overlook when making later seemingly unrelated configuration adjustments.

Dangers

It often has security implications when headers you thought were being added to every response are not. Consider more fine-tuned and consequential security-related headers such as Content-Security-Policy, Vary for cache object separation, CORS headers Access-Control-*, etc.

Headers such as these are especially important when they need to be added based on logic spread across various configuration blocks, and that is exactly when nginx add_headers doesn’t work as expected.

Another pitfall is omitting the always option to add_header. Without that, the header will only be added to success responses (2XX and 3XX, but see the docs for specifics). We usually want security-related headers to be added even to 4XX and 5XX error responses.

Workaround using include

My first instinct was to work around the problems caused by this behavior by putting the standard add_header list in a file that I include everywhere. In some cases that works.

But despite the nginx include documentation saying that directive is allowed in “Context: any”, include is not allowed in an if block and will result in the fatal startup error:

"include" directive is not allowed here

So the only recourse in those cases is to repeat all needed add_header directives in every if block that uses add_header. Gross.

Repeating configuration manually means almost surely having the add_header directives in different configuration areas drift over time. So if we have to repeat ourselves, at least let’s do it with automation, such as by using configuration templating and preprocessing.

That is what I have most recently done. And we can still use native nginx include directives everywhere those are allowed.

nginx Headers More module

Many people have run into exactly this problem, and some of them developed a separate nginx module ngx_headers_more to solve most of these problems.

By using its more_set_headers directive, you get the expected additive behavior with previously-declared headers, regardless of the block scope:

Directives inherited from an upper level scope (say, http block or server blocks) are executed before the directives in the location block.

Note that although more_set_headers is allowed in location if blocks, it is not allowed in the server if blocks …

Fortunately I have not needed to use this in an if block in the server scope, so that one remaining limitation doesn’t pose a problem for me.

It also has options to set a header only for responses of a certain HTTP content type or status code.

The more_clear_headers directive allows the * wildcard for clearing all headers with the same prefix at once, such as Access-Control-*.

Installing ngx_headers_more

Because “Headers More” is a separate module, not part of standard nginx, it is not usually available without some extra work.

You can build it from source and install it manually, but of course that isn’t good to do on a production machine since it won’t get updated on its own.

You can use the OpenResty server built around nginx, which “Headers More” is part of. But you may not want all of that if you’re not writing a Lua web application.

Many Linux distributions and 3rd-party package repositories have prebuilt packages for “Headers More” which you can use:

  • Alpine
    • nginx-mod-http-headers-more
  • Debian & Ubuntu
    • nginx-extras
    • libnginx-mod-http-headers-more-filter
  • RHEL/CentOS
    • GetPageSpeed & Webtatic repos nginx-module-headers-more
    • Aeris repo nginx-more

Search the excellent pkgs.org to find what you need if it isn’t already available through your package manager.

Apache

Apache httpd is still alive and well — actually better than ever. So depending on your situation, you may want to use that instead.

Apache’s Header directive has intuitive (to me) default behavior for setting response headers across the whole configuration, and many ways to deal with a possibly already-existing header:

  • add another header, or set exclusively (replace), or set only if this header doesn’t already exist
  • append to or merge into an existing header (for headers that accept multiple values)
  • edit an existing header with a regular expression search-and-replace
  • unset a header if one was previously set

I don’t know a way to have Apache clear a group of headers with a wildcard, or all headers at once, so they need to be individually cleared by name if that’s what you want.

Доверяй, но проверяй (Trust, but verify)

nginx was written by Igor Sysoev. Despite my disagreement with this one feature’s behavior, overall I find that nginx is excellent. Because of its open source release, excellent performance, and wide use, it has provided much-needed competition to Apache and Microsoft IIS. Thank you, Igor and all other contributors!

In the relevant spirit, since Igor is Russian, I close with the Russian proverb Доверяй, но проверяй: Trust, but verify.

Let us code (and configure) defensively, yet also test to avoid being surprised by missing headers.

We can manually test various HTTP responses are as we expect using curl -v or other HTTP clients to exercise various requests.

Even better, we can add to our automated test suite to confirm these HTTP response headers appear everywhere we expect, for static files and API endpoints backed by different application servers, and for various success and error responses.

Here is a test adapted from one I put together for one of our clients. It uses JavaScript in Node.js, the Jest test framework, and the Axios HTTP client. It ensures the security headers example I showed at the beginning of this article keeps working, even as we make nginx configuration changes over time:

const axios = require('axios');

const http = axios.create({
  baseURL: 'https://your.dom.ain',
});

describe('Check security headers', () => {
  const verifs = [
    { header: 'strict-transport-security', expect: (x) => x.toMatch(/max-age=\d{3,}/) },
    { header: 'x-content-type-options',    expect: (x) => x.toEqual('nosniff')        },
  ];

  const locs = [
    { path: '/robots.txt',                status: 200 },  // static
    { path: '/feed/endpoint/of/interest', status: 200 },  // API backend in PHP
    { path: '/api/other/auth/endpoint',   status: 403 },  // API backend in Perl
    { path: '/never/gonna/give/you/up!',  status: 404 },
    { path: '/api/dies/for/testing',      status: 500 },
  ];

  // throw no exceptions for non-success HTTP response status
  const conf = { validateStatus: () => true };

  for (const l of locs) {
    test(`${l.status} ${l.path}`, async () => {
      const res = await http.get(l.path, conf);
      expect(res.status).toBe(l.status);
      for (const v of verifs) {
        v.expect(expect(res.headers[v.header]));
      }
    });
  }
});

Here I run just this one test rather than the whole suite:

% jest -w 6 ./__tests__/webserver/security-headers.test.js
Determining test suites to run...
testing on https://https://your.dom.ain

 PASS  webserver/security-headers.test.js
  Check security headers
    ✓ 200 /robots.txt (55ms)
    ✓ 200 /feed/endpoint/of/interest (408ms)
    ✓ 403 /api/other/auth/endpoint (18ms)
    ✓ 404 /never/gonna/give/you/up! (6ms)
    ✓ 500 /api/dies/for/testing (12ms)

Test Suites: 1 passed, 1 total
Tests:       5 passed, 5 total
Snapshots:   0 total
Time:        2.721s, estimated 3s
Ran all test suites matching /.\/__tests__\/webserver\/security-headers.test.js/i.

This can also be extended to ensure that certain headers do not exist, or do not contain details that you do not want exposed:

  • the Server header should not reveal the Apache version number — see the ServerTokens directive
  • the X-Powered-By header should be absent, not exposing the fact that you are using PHP, and the version number — see the expose_php directive for php.ini
  • or with the Java Wildfly server, both of those headers are sent by default! — see instructions on how to omit them by editing XML or using jboss-cli

Now what if I forget about the nginx add_headers behavior, make changes, and inadvertently break things? Instead of it being unnoticed, my test suite will alert me so I can fix it before it goes into production!

Why Upgrading Software Libraries is Imperative

$
0
0

Image 0

Image by Tolu Olubode on Unsplash

Applications primarily run on front- and back-end programming languages, including necessary library dependencies. Operating Systems and programming languages can be periodically updated to run on the latest version, but what about the many libraries being used in the app’s front and backend? As we all know, it can be quite a daunting task to maintain and individually update a long list of software dependencies like the examples later in this post. Still, it is important to keep them updated. This post dives into our experience upgrading a complex app with a full software stack and lots of dependencies. We’ll examine the benefits of upgrading, what you will need, and how to go about such an upgrade as simply as possible below.

The app in question contained decade-old software and included extensive libraries when we received it from our client. The app used languages including Java, Scala, Kotlin, and JavaScript along with many libraries. The initial plan was to upgrade the complete software stack and libraries all at once due to the gap between versions. This proved to be more difficult than expected due to a host of deprecated and removed functionalities as well as interdependence of a few of the libraries.

Conflict approach: “Don’t update unless you have to”

While this can be sustainable in the short term, it quickly becomes less applicable in the long run. One important purpose of updates is to (hopefully) protect from new vulnerabilities and cyber attacks. Scenarios arise where particular library fixes are implemented on the latest version, yet require upgrading other libraries to the latest version in a chain. Because upgraded libraries need extensive testing and preparation for new issues, this directly impacts whether the app attempts to resolve an issue. Therefore, smaller and more frequent updates are more sustainable in the long run. Larger and more infrequent upgrades will not only result in unexpected errors, but also require regression testing to deliver a bug-free update. The following reasons justify the necessity of software and library upgrades:

  • They apply timely security patches to reduce vulnerabilities and defend from cyberattack
  • You can identify deprecated functionality earlier and use alternatives
  • They can apply fixes for known bugs in the library
  • They’re based on the latest versions of software or library, encouraging stability

In addition, staying on the latest major version yields the benefit of applying minor and patch releases seamlessly without risk. Most software and libraries use semantic versioning, formatted as MAJOR.MINOR.PATCH (examples below).

  • MAJOR version - Incompatible API changes
  • MINOR version - Add functionality with backward compatiblity
  • PATCH version - Bug fixes with backward compatibility

So being on the latest major versions of libraries makes applying minor and patch releases without breaking existing functionalities of the app.

Benefits

The benefits of keeping software and libraries updated include bug fixes, new features, boosted performance, as well as better stability, compatibility, and security measures. We can often ignore upgrades in projects because we don’t perceive significant effects in appearance or usage. But on closer inspection, frequent updates deliver advantages which clearly demonstrate the importance of upgrading software libraries.

Real-time difficulties

In real-life scenarios, bug fixes and performance optimization of an app often take a higher priority than upgrading libraries. In many cases, this results in a discrepancy between the app’s version and the current version. This discrepancy can lead to the following issues when upgrading libraries:

  • Unexpected error handling in new releases
  • Unexpected unsupported library dependency errors
  • Need for complete end-to-end testing due to major version of library upgrade
  • Lack of demonstration of work being done can lead to a lack of confidence from client
  • Difficulty in estimating workload due to major unexpected errors

The following passages offer guidance in how to adequately prepare for these issues.

Get prepared

Much can be learned from a major upgrade on both a business and technical level, so being prepared for such an endeavor is imperative. First, a complete list of libraries from the entire app should be compiled. This list should include the latest and current version of the library in the app. Reviewing the changelog of both versions will help identify potential incompatibilities or problems. Visiting the “issue report page” of the library to monitor any version-specific issues is also recommended. Once you’ve adequately prepared the libraries for upgrading, you can use any method of your choosing to upgrade and maintain the list. Once the roadmap is established, upgrading can commence and compatibility issues that arise can be dealt with in real time. Finally, thorough end-to-end testing is necessary once the upgrade process is complete.

Below are some examples of software lists you might have for a project (we left out the latest versions for brevity in this post, but it’s helpful to have a column for those too!).

Programming software & servers
SoftwareRunningRelease date
Ubuntu12.04.52017-04-28
nginx1.12.22017-10-17
Jetty Java Servlet container9.2.32014-09-05
JVM/Java1.7u792015-04-14
Scala2.112014-04-21
Kotlin0.9.2232014-10-23
Lucene4.10.02014-09-03
PostgreSQL
Java libraries
SoftwareRunningRelease date
joda-convert1.3.12013-03-01
joda-time2.22013-03-01
play-json2.7.02019-01
httpclient4.1.22011-07-10
httpcore4.1.22011-07-22
httpmime4.1.22011-07-29
commons-codec1.32005-10-01
jooq3.1.02013-06-30
jooq-codegen3.1.02013-06-30
jooq-meta3.1.02013-06-30
jtds1.3.02012-10-27
servlet-api3.12013-04-25
play-json-joda2.11-2.6.92018-03-01
lucene-core4.10.02014-09-02
lucene-memory4.10.02014-09-02
lucene-queries4.10.02014-09-02
lucene-queryparser4.10.02014-09-02
lucene-highilghter4.10.02014-09-02
lucene-analyzers-common4.10.02014-09-02
postgresql9.2-1003-jdbc42013-05-27
jsoup1.7.32013-11-11
jetty-util9.2.5.v201411122014-11-12
tika-core1.162017-07-07
commons-fileupload1.32013-03-24
commons-beanutils1.62005-11-08
commons-collections2.12005-11-08
commons-digester2.12010-09-24
commons-io1.12005-11-24
commons-lang1.0.12005-11-24
commons-logging1.0.32005-11-17
commons-validator1.0.22005-11-24
slf4j-api1.6.42011-10-31
slf4j-nop1.6.42011-10-31
JavaScript libraries
SoftwareRunningRelease date
angular1.2.112014-02-03
jQuery1.10.22013-07-03
calendar1.512005-03-07
calendar-setup1.252005-03-07
fckeditor2.6.112014-06-02
dom-drag2001-10-28
jt_DialogBox292012-08
jt_AppDialogs92005-05
ng-grid2.0.72013-06-07
ng-grid-flexible-height2.0.72013-06-07
ng-grid.css2.0.72013-06-07

Recommendations

Once the app’s software stack and libraries are up to date it’s important to regularly update. The complete upgrade experience can be a rigorous and involved process that includes juggling numerous issues and intensive testing. From a client’s perspective, a major upgrade project often doesn’t obviously demonstrate improvement to appearance or function. Because it is not recommended to keep outdated software and major upgrades (such as the one described above) are tedious, the following update options are preferable in order to keep software libraries as up-to-date as possible:

  • Interval Based: Particular interval periods are implemented to check versions of softwares and libraries. This minimizes frequency of errors while ensuring a minimal gap between the current and latest software version.
  • Module Based: Whenever work is performed on any module for new features or bug fixes, software and libraries versions are reviewed. This allows for relevant libraries to be updated, tested, and deployed along with development changes within the module.

Conclusion

Even if new features and improvements are not pertinent to the new release, frequent upgrades to software and libraries are crucial in order to ensure the most secure and debugged versions. With security as the highest priority, these upgrades are not only imperative, but unavoidable. Please feel free to share your methods of upgrading software libraries in a comment!

Jamstack Conf Virtual 2020: Thoughts & Highlights

$
0
0

Conference

Welcome to Jamstack Conf Virtual 2020

Last week I attended Jamstack Conf Virtual 2020. It had originally been slated to take place in London, UK but was later transformed into a virtual event in light of the COVID-19 pandemic. The conference began at 2pm London time (thankfully I double-checked this the night before!)—​6am for those of us in the Pacific Time Zone.

Before getting too much further I wanted to mention that if you are not familiar with the Jamstack, You can read more about it at jamstack.org.

To virtually participate in the conference we used an app called Hopin. I had not heard of it before but was impressed with how well it worked. There were over 3000 attendees from 130+ countries one of the times I checked. Phil Hawksworth was the Host/​MC for the event and did a great job. There were virtual spaces for the stage, sessions, expo (vendors), and networking. If you opted to, the networking feature paired you with a random attendee for a video chat. I’m not sure what I expected going into it but I thought it was fun. I met a fellow developer from the Dominican Republic. The experience was very similar though more serendipitous than the hallway track or lunch line at an in-person conference.

Phil Hawksworth welcoming the attendees

Keynote

Matt Biilmann opened the conference with a keynote address about the challenges we face as a developer community trying to improve access to accurate, timely and locally relevant information to a global audience. Many billions of users with all kinds of devices and varying levels of connectivity. He moved on to share how Netlify is trying to enable developers to “build websites instead of infrastructure” and “ensure all the best practices become common practices” through features like git-based deployments, build plugins, and edge handlers (more on those later).

State of the Jamstack Survey results

Laurie Voss reporting findings from the Jamstack Survey 2020

Laurie Vosswalked us through the results of the Jamstack Survey 2020. There were some interesting findings and surprises. Later on I read Laurie’s post (which Matt had mentioned in his talk) and found that very interesting as well.

Fireside chat with Harper Reed

Frances Berriman interviewed Harper Reed - fireside chat style

Frances Berriman chatted with Harper Reed and asked him about his application of Jamstack principles from years ago when he led the technology team for Barack Obama’s election campaign. He described the need to “get far with very limited resources” and spoke about experimenting with serving HTML from Google App Engine and Amazon S3. Using pre-built HTML allowed them to scale very efficiently and he opted to use message queues rather than interacting with the database to keep things very quick for users.

Another benefit Harper noted was how quickly and easily new team members could be onboarded. Folks who knew HTML, CSS and JavaScript would be up to speed and productive in no time. He admitted it’s more complicated today 😀. Speed is another benefit—he just loves when he comes across a super-fast web site (often mostly static).

Lightning Launch: Netlify ⚡

David Calavera gave a demo of Netlify’s new Edge Handlers feature which lets developers add logic to their code at the edge (e.g. the CDN servers geographically closest to the user). He demonstrated how Edge Handlers make it possible to examine the headers of the request to tailor the response to that unique request. Check out the video of his talk to watch him live code an example. I believe Cloudflare Workers and Fastly’s Edge Compute are operating in a similar problem space. I plan to look into each of these offerings more thoroughly in future.

Lightning Launch: Prismic ⚡

Renaud Bressand from Prismic demoed Slicemachine—​a new feature from Prismic which lets you combine nuxt/​Vue components with content managed in Prismic. It looked like a very compelling way of enabling better collaboration between developers and content creators.

Lightning Launch: RedwoodJS ⚡

Tom Preston-Werner demoed his latest project RedwoodJS. I had heard Tom speak about this on a few podcasts recently and it was nice to see him demo it for us. It looks interesting and feels reminiscent of Rails. RedwoodJS simplifies wiring up your database to a GraphQL API (they use Prisma for this) and integrating it into a React application. Tom is also the creator of Jekyll—​a Jamstack-style tool which has been around for many years. It was nice to see several speakers give him some recognition for his work on that project.

The COVID Tracking Project: 0 to 2M API requests in 3 months

Erin Kissane presenting The COVID Tracking Project

Erin Kissane gave an inspiring talk about her work on The COVID Tracking Project. She described it as an “interim public data rescue project”. Erin and some friends, journalists and volunteers worked together to create the site. They started by scraping COVID data from each state’s web site and storing it in a Google Sheet. They used Gatsby, Contentful, and Sass modules and hosted the site on Netlify. Using the Jamstack approach allowed the site to remain performant and continue to be responsive under some huge traffic spikes. Over time, they iterated on the design and continue to improve the site daily. I highly recommend checking out the video of the talk when you get a chance.

Jamstack for emerging markets

Christian Nwamba described some of the challenges of building sites for users in Nigeria (low power devices, spotty connectivity, unreliable power). He shared that 55% of the most visited sites in Nigeria are global companies (Google, Facebook/​Instagram, Netflix, Stack Overflow etc.). Christian reviewed a large, popular banking site in Nigeria and noted its many shortcomings.

To demonstrate how the bank might do better he built an app for transferring money between friends & family built in the Jamstack style and using serverless functions. The most interesting thing I picked up from this was his method of using serverless functions to protect the app secrets (API keys, etc.). The front end of the application did not need to concern itself with this—​the serverless functions kept the secrets safe and acted as a proxy between the frontend and the backend APIs. Be sure to take a look at Christian’s code if you are interested.

Managing diabetes with Jamstack

Jamie Bradley taught us about diabetes and the Jamstack app he built to help himself and others manage it. He built HeySugar with Gatsby and Sanity, hosted it on Netlify. He’s making it easier for others to deploy their own instances as well.

Selling tickets without servers. Or frameworks. Or cookies.

Jan van Hellemond has volunteered for years at a very popular conference in Europe (Fronteers I think). In previous years tickets sold out in 6 minutes one year, and 30 seconds in another! Their ticket vendor was struggling to handle to the load and this caused them to oversell early bird tickets. Jan built a Jamstack site to sell their tickets and was very pleased with the performance. He used simple, single-purpose vanilla JavaScript components and simple, single-purpose API handlers (serverless functions).

Jan presenting his Jamstack ticket selling app

Jan prerendered as much as possible and seeded a relational database with the tickets available for sale. As the tickets were sold, they were marked sold with a database update. Webhooks were used as customers stepped through the checkout flow. Jan joked about deploying to production on a Friday afternoon because of how safe and simple the deployment process was. There were no services to restart, etc. because “it’s just files”. It was cool to see a practical example and that Jan used the basic building blocks of the web (HTML, JavaScript, CSS, old school links) successfully without reaching for a large JavaScript framework.

The business side of the Jamstack

Ana Rossetto shared how her company has been having great success with the Jamstack. Previously, the agency had been building projects for clients with Drupal. She walked through a project she and the team built to encourage people to buy from small, locally-owned businesses. She was impressed with what they were able to create in a very short amount of time.

Build plugin authors’ session

After the main talks there were several sessions. In the Hopin app, I was able to peek into some of these and the presenters and attendees were chatting (both text chat and video). This was very much like the experience of peeking into conference rooms at a physical venue and choosing whether to stay and participate or move on. After some wandering I chose to attend a session with three Netlify build plugin authors.

Peter telling us about subfont

Peter Müller built Subfont with a friend. He demonstrated how subsetting web fonts (i.e. only loading the characters & glyphs you really need) can dramatically improve frontend performance. He compared the WebPageTest results for Google Fonts with and without Subfont and Subfont was seconds faster! I have subset and locally hosted webfonts in several client projects here at End Point. It takes time and is a manual process. Peter’s plugin makes this excellent performance improvement relatively painless.

David Darnes demoed his build plugin to turn Ghost content into Markdown files. Very interesting and it seemed flexible enough to work with other tools as well.

Gleb Bahmutov presented the build plugin he created to run Cypress tests as part of your Netlify build process. It was amazing to see how simple it was (single devDependency and single line in the netlify.toml config file).

Videos from the conference

Netlify has already put all of the Jamstack Conf Virtual 2020 talks on YouTube so head over there and check those out if you’d like to. Thanks very much to the team at Netlify for organizing the conf, all of the speakers, vendors and attendees!

Linux Development in Windows 10 with Docker and WSL 2

$
0
0

Banner

I’m first and foremost a Windows guy. But for a few years now, moving away from working mostly with .NET and into a plethora of open source technologies has given me the opportunity to change platforms and run a Linux-based system as my daily driver. Ubuntu, which I honestly love for work, has been serving me well by supporting my development workflow with languages like PHP, JavaScript and Ruby. And with the help of the excellent Visual Studio Code editor, I’ve never looked back. There’s always been an inclination in the back of my mind though, to take some time and give Windows another shot.

With the latest improvements coming to the Windows Subsystem for Linux with its second version, the new and exciting Windows Terminal, and Docker support for running containers inside WSL2, I think the time is now.

In this post, we’ll walk through the steps I took to set up a PHP development environment in Windows, running in a Ubuntu Docker container running on WSL 2, and VS Code. Let’s go.

Note: You have to be on the latest version of Windows 10 Pro (Version 2004) in order to install WSL 2 by the usual methods. If not, you’d need to be part of the Windows Insider Program to have access to the software.

What’s new with WSL 2

This is best explained by the official documentation. However, being a WSL 1 veteran, I’ll a few improvements made which have sparked my interest in trying it again.

1. It’s faster and more compatible

WSL 2 introduces a complete architectural overhaul. Now, Windows ships with a full Linux Kernel which WSL 2 distributions run on. This results in greatly improved file system performance and much better compatibility with Linux programs. It’s no longer running a Linux look-alike, but actual Linux.

2. It’s better integrated with Windows

This is a small one: we can now use the Windows explorer to browse files within a WSL distribution. This is not a WSL 2 exclusive feature, it has been there for a while now. I think it’s worth mentioning though, because it truly is a great convenience and a far cry from WSL’s first release, where Microsoft specifically advised against manipulating WSL distribution file systems from Windows. If nothing else, this makes WSL feel like a first class citizen in the Windows ecosystem and shows that Microsoft actually cares about making it a good experience.

3. It can run Docker

I’ve recently been learning more and more about Docker and it’s quickly becoming my preferred way of setting up development environments. Due to its lightweightness, ease of use, repeatability, and VM-like compartmentalization, I find it really convenient to develop against a purpose-built Docker container, rather than directly in my local machine. And with VS Code’s Remote development extension, the whole thing is very easy to set up. Docker for Windows now supports running containers within WSL, so I’m eager to try that out and see how it all works.

4. A newer version means several bugfixes

Performance notwithstanding, WSL’s first release was pretty stable. I did, however, encounter some weird bugs and gotchas when working with the likes of SSH and Ruby during certain tasks. It was nothing major, as workarounds were readily available. I’ve already discussed some of them here, so I won’t bother mentioning them here again. But thanks to the fact that the technology has matured since last time I saw it, and considering the architectural direction it is going in, I’m excited to not have to deal with any number of quirks.

The development environment

Ok, now with some of the motivation out of the way, let’s try and build a quick PHP Hello World app running in a Docker container inside WSL 2, make sure we can edit and debug it with VS Code, and access it in a browser from Windows.

Step 1: Install WSL 2 and Ubuntu

Step 1 is obviously to install WSL and a Linux distribution that we like. Microsoft’s own documentation offers an excellent guide on how to do just that. But in summary, we need to:

  1. Enable the “Windows Subsystem for Linux” and “Virtual Machine Platform” features by running these on an elevated PowerShell: ps dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart
  2. Restart your machine.
  3. Set WSL 2 as the default version with: wsl --set-default-version 2, also from PowerShell.
  4. Install your desired distribution from the Microsoft Store. I chose Ubuntu 20.04 LTS.
  5. After installing, open the “Ubuntu 20.04 LTS” app from the Start menu and it should come up with a command line console. Wait for it to finish installing. It should prompt for a username and password along the way. Choose something you won’t forget.

Optionally, you can install the Windows Terminal app to get a better command line experience. Windows Terminal can be used to interact with PowerShell and the classic CMD, as well as with our WSL distributions.

Step 2: Install Docker

Installing Docker is very straightforward. Just download the installer for Docker Desktop for Windows, execute it, and follow the wizard’s steps. Make sure that during setup the “Use the WSL 2 based engine” option is selected. In most cases, the installer will detect WSL 2 and automatically have the option selected.

Follow the official instructions for more details on the process, but it really is that simple.

Step 3: Install some useful VS Code extensions

Our objective is to create a new development environment inside a Docker container and connect to it directly with VS Code. To do that, we use a few useful extensions:

  1. The Docker extension which allows us to browse and manage images and containers and other types of Docker assets.
  2. The Remote - WSL extension which allows VS Code to connect to a WSL distribution.
  3. The Remote - Containers extension which allows VS Code to connect to a container.

Step 4: Create the development container

The extensions that we installed will allow us to use VS Code to work on code from within our WSL Ubuntu as well as from the container. Specifically, we want to connect VS Code to a container. There are a few ways to do this, but I will describe the one I think is the easiest, most convenient and “automagic” by fully leveraging the tools.

Let’s begin by opening a WSL Ubuntu terminal session, which will show something like this:

Welcome to Ubuntu 20.04 LTS (GNU/Linux 4.19.104-microsoft-standard x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

...

kevin@kevin-thinkpad:/mnt/c/Users/kevin$

The project directory

Let’s change to our home, create a new directory for our new project, and change into it.

$ cd
$ mkdir php-in-docker-demo
$ cd php-in-docker-demo

Because we installed the Remote - WSL extension, we can open up this directory in VS Code with code .. Opening a terminal (Ctrl + `) in this VS Code instance opens WSL console, not Windows.

The Dockerfile

Now let’s create a new file called Dockerfile which will define what our development environment image will look like. For a no-frills PHP environment, mine looks like this:

# The FROM statement says that our image will be based on the official Ubuntu Docker image from Docker Hub: https://hub.docker.com/_/ubuntu
FROM ubuntu

# The RUN statement executes the command that follows it inside the container
# These install PHP and its prerequisite
RUN apt-get update && apt-get install -y software-properties-common
RUN apt-get update && apt-get install -y php

# These ones install Xdebug and configure it so that the VS Code debugger can use it.
RUN apt-get update && apt-get install -y php-xdebug
RUN echo "xdebug.remote_enable=on" >> /etc/php/7.4/mods-available/xdebug.ini
RUN echo "xdebug.remote_autostart=on" >> /etc/php/7.4/mods-available/xdebug.ini

# This installs Composer
RUN apt-get update && apt-get install -y composer

# The CMD statement tells Docker which command to run when it starts up the container.
# Here, we just call bash
CMD ["bash"]

This script will later be used to create our development container. It will have PHP, Xdebug and Composer. This is all we need for our simple Hello World app. For more complex scenarios, other software like database clients or PHP extensions can be easily installed with additional RUN statements that call upon the apt package manager.

Consider reading through Docker’s official documentation on Dockerfiles to learn more.

The configuration file

Now, to leverage VS Code’s capabilities, let’s add a development container configuration file. In our current location, we need to create a new directory called .devcontainer and, inside that, a new file called devcontainer.json. I put these contents in mine:

{
    // The name used by VS Code to identify this development environment
    "name": "PHP in Docker Demo",

    // Sets the run context to one level up instead of the .devcontainer folder.
    "context": "..",

    // Update the 'dockerFile' property if you aren't using the standard 'Dockerfile' filename.
    "dockerFile": "../Dockerfile",

    // Add the IDs of extensions you want installed when the container is created.
    // This is the VS Code PHP Debug extension.
    // It needs to be installed in the container for us to have access to it.
    "extensions": [
        "felixfbecker.php-debug"
    ],

    // Use 'forwardPorts' to make a list of ports inside the container available locally.
    // When we run our PHP app, we will use this port.
    "forwardPorts": [5000],
}

A default version of this file can be automatically generated by running the “Remote-Containers: Add Development Container Configuration Files…” command in VS Code’s Command Palette (Ctrl + Shift + P).

The development container

Now that we have all that in place, we can create our image, run our container, and start coding our app. Bring up the VS Code Command Palette with Ctrl + Shift + P and run the “Remote-Containers: Reopen in Container” command. The command will:

  1. Read the Dockerfile and create an image based on that. This is like running docker build -t AUTOGENERATED_IMAGE_ID .
  2. Run a container based on that image with the settings specified in .devcontainer/devcontainer.json. In our case, all it will do is enable the container’s port 5000 to be accessible by the host. This is more or less like running: docker run -d -p 5000:5000 -v ${PWD}:/workspaces/php-in-docker-demo AUTOGENERATED_IMAGE_ID
  3. Open a new VS Code instance connected to the container with the /workspaces/php-in-docker-demo directory open.

It will take a while, but after it’s done, we will have a VS Code instance running directly in the container. Open the VS Code terminal with Ctrl + ` and see for yourself. It will show a prompt looking like this:

root@ec5be7dd0b9b:/workspaces/php-in-docker-demo#

You can for example, run php -v in this terminal, and expect something along these lines:

PHP 7.4.3 (cli) (built: May 26 2020 12:24:22) ( NTS )
Copyright (c) The PHP Group
Zend Engine v3.4.0, Copyright (c) Zend Technologies
    with Zend OPcache v7.4.3, Copyright (c), by Zend Technologies

This is PHP running, not in Windows, not in our WSL Ubuntu, but in the Docker container.

Hello Windows + WSL 2 + Ubuntu + Docker + PHP + VS Code

Let’s now create our app. Add a new index.php file containing something silly like:

<?php

echo "Hello Windows + WSL 2 + Ubuntu + Docker + PHP + Visual Studio Code!";

Then, in the VS Code console (remember, Ctrl + `), start up an instance of the built in PHP development server wth php -S 0.0.0.0:5000. It’s important that we use port 5000 because that’s the one that we configured our container to use.

Navigate to http://localhost:5000/ in your browser and feel good about a job well done.

Running app

Interactive debugging

When configuring our development container, we added Xdebug and the PHP Debug VS Code extension. This means that VS Code can leverage Xdebug to provide an interactive debugging experience for PHP code.

Almost everyting is set up at this point, we just need to do the usual VS Code configuration and add a launch.json file. To do so, in VS Code, press Ctrl + Shift + D to bring up the “Run” panel, click on the “create a launch.json file” link, and in the resulting “Select Environment” menu, select “PHP”.

Running app

After that, the “Run” panel will show a green triangular “Start Debugging” button next to a “Listen to XDebug” text. If you haven’t already, start up a dev web server with php -S 0.0.0.0:5000, click on the “Start Debugging” button, put a breakpoint somewhere in your index.php file, and finally open up http://localhost:5000/ in a browser.

Running app

We’re interactively debugging PHP code running on a Docker container in WSL from our Windows IDE/​editor. Pretty cool, huh?

And that’s all for now. In this article we’ve learned how to set up a Linux development environment using Docker containers and WSL 2, with Windows 10 Pro. This is a nice approach for anybody who’s confortable on Windows and needs access to a Linux environment for development; and have that environment be easy to reproduce.

Resources:

Magento 2: Creating a custom theme

$
0
0

blue and yellow paint from a tube on a canvasPhoto by Maria Eklind, CC BY-SA 2.0

In my previous post, we went through the steps needed to create a custom module in Magento 2. While modules consist of a set of classes to add new features to Magento, a theme controls how these features, and the entire website in general, will be displayed to the user. As stated in the Magento guide, a theme uses a combination of custom templates, layouts, styles, and images to provide a consistent look and feel across a Magento store.

Creating a new Magento 2 theme

We can create a theme based on a default “parent” theme or create a standalone theme from scratch. In most cases, I would recommend the first option. For this example, we will use Luma as our parent theme. The other option would be inheriting from the default “blank” theme.

Here’s an initial task list to get our new theme ready:

  • Create a new directory for the theme
  • Create the registration.php script
  • Create the theme.xml information file
  • Activate the new theme

Creating a new directory for the theme

While all our backend code should go in app/code, the frontend content is expected to go in app/design. And as our theme will only apply design changes to the frontend content, we should create the new directory for it under the path app/design/frontend. If we want to create a theme for the admin area instead, we need to create the directory inside app/design/adminhtml.

Let’s create a directory named EndPoint (our vendor name, continuing with the example from our previous article) and a subdirectory inside it, MyTheme:

cd {website_root}
mkdir -p app/design/frontend/EndPoint/MyTheme

Creating registration.php

Similar to the file we created for our module, registration.php tells Magento to register the new theme with the name and location we specify. Our file will be located at app/design/frontend/EndPoint/MyTheme/registration.php and should have the following content:

<?php
\Magento\Framework\Component\ComponentRegistrar::register(
    \Magento\Framework\Component\ComponentRegistrar::THEME,
    'frontend/EndPoint/MyTheme',
    __DIR__
);

This way, Magento will know what path our theme will have.

Creating theme.xml

The next step is to create our theme information file, where we will specify the theme name and parent theme. So our app/design/frontend/EndPoint/MyTheme/theme.xml file should have the following content:

<?xml version="1.0"?>
<theme xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="urn:magento:framework:Config/etc/theme.xsd">
    <title>MyTheme</title>
    <parent>Magento/luma</parent>
</theme>

Optional: If we want our theme to be easily distributed as a package, we can also create a composer.json file in the theme’s root directory. The content for the file should be as follows, specifying the theme’s description, dependencies, version, and license types:

{
    "name": "endpoint/mytheme",
    "description": "My Theme by End Point",
    "require": {
        "magento/theme-frontend-luma": "100.0.*",
        "magento/framework": "100.0.*"
    },
    "type": "magento2-theme",
    "version": "100.0.1",
    "license": [
        "OSL-3.0",
        "AFL-3.0"
    ],
    "autoload": {
        "files": [
            "registration.php"
        ]
    }
}

Activating our new theme

That was easy! We have everything we need to activate our new theme. To do that we log in to our admin area and enable our theme. Once in the dashboard, we need to go to Content > Design > Configuration, edit our store view, and select our new theme from the dropdown list:

Selecting our theme

Magento will search for new themes every time we log in to the admin area, so our new theme will appear on the list automatically.

Adding custom content

We have the basic structure for our theme, but when enabled, it will look the same as its parent theme (Luma, in this case), since we didn’t add any design rules or static files yet. Let’s do some more things with our theme to change how it’s displayed in the frontend:

  • Create a custom etc/view.xml file
  • Add a custom logo
  • Add static files (JavaScript, CSS, images, fonts)
  • Add a custom layout

Creating a custom view file

etc/view.xml controls many frontend configurations like the product thumbnail width, how the product image gallery is displayed, and the image magnifier tool, among other things. To add our custom view file to our theme, we need to copy the existing file from our parent theme. For Luma, it will be located at vendor/magento/theme-frontend-luma/etc/view.xml. To copy the file, we need to run the following in our website’s root:

mkdir -p app/design/frontend/EndPoint/MyTheme/etc
cp vendor/magento/theme-frontend-blank/etc/view.xml app/design/frontend/EndPoint/MyTheme/etc/view.xml

And then we can use our preferred text editor to change the values we want, like setting a custom size for the images in the category page grid:

<image id="category_page_grid" type="small_image">
    <width>300</width>
    <height>300</height>
</image>

Adding a custom logo

Adding a logo to our theme is really simple. We just need to save our picture in SVG format as web/images/logo.svg. If we want to use a different filename or format for our logo, we will have to create a default layout file for our theme in the path /Magento_Theme/layout/default.xml inside our theme root with content similar to this:

<page xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="urn:magento:framework:View/Layout/etc/page_configuration.xsd">
  <body>
    <referenceBlock name="logo">
      <arguments>
        <argument name="logo_file" xsi:type="string">images/custom_logo.png</argument>
        <argument name="logo_width" xsi:type="number">300</argument>
        <argument name="logo_height" xsi:type="number">200</argument>
        <argument name="logo_alt" xsi:type="string">Custom logo name</argument>
      </arguments>
    </referenceBlock>
  </body>
</page>

We can use different image formats such as SVG, PNG, or JPG. We can also use a custom width and height for the logo, and set a custom alternate text.

Adding static files (JavaScript/​CSS/​images/​fonts)

All the static files should be located inside the web directory. Common static files include JavaScript files, stylesheets, images, and fonts. The JavaScript files should be located at web/js, stylesheets at web/css, images at web/images, and our custom fonts should be located at web/fonts.

All the static files will be published as direct links, without any processing from Magento, at the pub/static/frontend/EndPoint/MyTheme/en_US path. The default locale/​language is en_US; we can change it for our theme if needed.

Adding a custom layout

Finally, if we want to use the new assets we added and have custom content on different sections of our website, we need to extend or override the existing layout files from our parent theme.

For example, if we want to add a reference to new stylesheet and JavaScript files we added, we need to extend the existing header layout from our parent theme. To do this, we will create a new layout file located at Magento_Theme/layout/default_head_blocks.xml:

<page xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="urn:magento:framework:View/Layout/etc/page_configuration.xsd">
  <head>
    <!-- Custom stylesheet -->
    <css src="css/mytheme.css"/>

    <!-- Custom JavaScript -->
    <script src="js/mytheme.js"/>
  </head>
</page>

This way we will be adding a reference to a new stylesheet named mytheme.css that we have inside the web/css directory of our theme, and a new script named mytheme.js inside the web/js directory.

After we make all the desired changes to our theme, we will need to tell Magento to update the frontend. We need to deploy the new changes and clear the cache. We can achieve that by running this from our website root:

php bin/magento setup:static-content:deploy
php bin/magento cache:clean

This process can take some minutes to complete. After it’s done, we can go to our website to see if the frontend changes are applied:

Website homepage

Looks awesome! Of course, there’s a lot more we can do from there, from extending or overriding layouts from modules (Magento or third-party) to bundling scripts or using Less files for our custom styles. But, that is material for later posts, so that’s all for now! Please add any questions you might have below.

Viewing all 1118 articles
Browse latest View live