Skip to content

SVG

1 post with the tag “SVG”

Image Data, OCR, and SVG

Working with image data is far more complex than it looks. For a long time, I hesitated to dare even touch it. In most cases, the real solution is not “How do we extract information from an image?” but “How do we avoid needing the image in the first place?” The latter is almost always better. Still, there are times when avoiding the image is impossible.

At first, pulling information from an image seems simple. Our eyes can easily see and distinguish these words and shapes with ease. Why can’t software do the same?

My recent use case forced my hand to deal with this problem. At my job, I wanted to automate the repetitive task of converting client mockups of labels into reusable YAML and SVG template files. These templates allow us to populate designs with dynamic information. Anyone who has written SVG by hand knows how painful the process can be. For those who have not, imagine building a basic web page in early 2000s HTML with no framework or tool. It is not enjoyable.

Automation was the obvious solution, but client specifications require accuracy down to fine visual details. When researching this problem, I looked at how receipt-scanning apps work. These apps take an image and extract useful information. Some use the data for documentation. Others harvest it to sell to data-farm. Either way, they rely on Optical Character Recognition (OCR).

For my project, I chose Tesseract.js, a Node.js library that makes OCR as close to drag-and-drop as possible.

Unlike text files, images contain zero semantic information or useful metadata. Every detail is stored as colored squares with no knowledge of words or meaning. To extract structure from this chaos the OCR software must:

  1. Detect shapes that look like characters.
  2. Group them into words, lines, and paragraphs.
  3. Estimate bounding boxes around each element.

This is deeply fragile work. OCR is prone to errors when images are noisy or spacing is inconsistent. Even when text is recognized, there are questions of accuracy and reliability.

Tesseract.js makes it relatively simple to run OCR in JavaScript:

const { data } = await worker.recognize(imageBuffer, {}, { blocks: true });

The result is a tree of blocks, paragraphs, lines, and words. Each node comes with bounding boxes. The problem is that these boxes are often messy. Words can be grouped incorrectly. Boxes may overlap or leave gaps. Confidence scores can often be too low to be relied on.

To handle this, I wrote my own WordBox objects that store text, bounding boxes, and confidence values. I walked through the nested structure of blocks to paragraphs to lines, and from there into words.

Once I had bounding boxes, the next challenge was converting them into SVG. For each line of text, I created a <tspan> element with coordinates taken from the WordBox. This made the text editable and ready to be replaced with dynamic values.

Here is the core builder function:

const buildSvgFromBoxes = (boxes: WordBox[]) => {
const svgObj = {
svg: {
$: {
xmlns: 'http://www.w3.org/2000/svg',
width: '600',
height: '300'
},
text: [
{
tspan: [
{ $: { x: '10', y: '20' }, _: '' } // seed element
]
}
]
}
}
for (const box of boxes) {
svgObj.svg.text[0]?.tspan.push({
$: { x: box.bbox.x0.toString(), y: box.bbox.y0.toString() },
_: box.text
});
}
let builder = new Builder({ headless: true });
let xml = builder.buildObject(svgObj);
console.log(xml);
}

Working with image data is never simple. Part of the challenge is the technology. Part of it is the standards and design choices made decades ago. OCR is imperfect but it is still the best tool available. For anyone facing the same problem, I feel for you.