メイン コンテンツにスキップする
該当する項目はありません。
Dropbox Sign のロゴ
Dropbox Sign が選ばれる理由
アコーディオンの展開と折りたたみ

機能

ドキュメントへのオンライン署名
電子署名の作成
テンプレートを選択または作成する
PDF への入力と署名
契約書へのオンライン署名
ドキュメント管理
機能を見る
右矢印のアイコン

ユースケース

セールス/ビジネス開発
人事
スタートアップ
金融テクノロジー
不動産
オンデマンド サービス
製品
アコーディオンの展開と折りたたみ
Dropbox のアイコン
Sign
手軽に送信、手軽に署名
Dropbox のアイコン
Sign API
電子署名をワークフローに統合
Dropbox Fax のアイコン
Fax
ファクス機なしでファクスを送信
Dropbox インテグレーションのアイコン
インテグレーション
さまざまなツールと連携
リソース
アコーディオンの展開と折りたたみ
公式ブログ
ワークフローの専門知識と製品ニュース
お客様の体験談
実際の導入事例とその成果
ヘルプセンター
当社製品の詳細ガイド
リソース ライブラリ
レポート、動画、情報シート
開発者向け情報
価格
アコーディオンの展開と折りたたみ
Dropbox Sign の価格
ニーズに合わせてお選びください
Dropbox Sign API の価格
実際の導入事例とその成果
セールス担当に連絡
登録
セールス担当へ連絡する
ログイン
アコーディオンの展開と折りたたみ
Dropbox Sign
Dropbox Forms
Dropbox Fax
無料トライアル
公式ブログ
/
開発者向け情報

JavaScript で OCR を使用して画像や PDF からテキストを抽出

by 
Aniket Bhattacharyea
August 5, 2024
6
分(記事閲覧時間)
ツールチップのアイコン

新しい名前でも変わらぬ高品質!HelloSign の名称が Dropbox Sign になりました。

閉じるアイコン

Optical character recognition (OCR) is a technique that's used to convert images of texts into machine-encoded text. OCR is commonly used to extract textual data from images and convert handwritten, typed, or scanned text into editable and searchable text.

When it comes to data extraction, OCR is commonplace. It can be used to convert scanned or photographed documents into digital texts, enabling you to extract specific information, such as names, addresses, or numbers, for further processing. OCR can also be integrated with various tools and services. For example, with the help of OCR and a machine learning model, you can extract text from a resume and parse its contents. You can also combine OCR with computer vision to process real estate documents, such as mortgages and loan documents.

If you're interested in OCR, you've come to the right place. In this article, you'll learn how to use PDF.js and Tesseract.js to extract text from a PDF in JavaScript.

‍

What is OCR

OCR is a process that extracts textual data from images or documents. Early OCR systems read only one character at a time and could work with only one language, one font, and clean, high-resolution images. In contrast, modern OCRs typically have multilanguage, multifont support and can handle a variety of images, including blurry, distorted, noisy, and low-resolution images, with reasonable accuracy.

OCR is useful in many different contexts. For instance, you can extract items and prices from a receipt or invoice for data entry, or you can convert a scanned book into digital text for archiving. Additionally, you can use OCR to extract data from a user-uploaded document, such as a CV, certificate, or medical document.

‍

Prerequisites

To follow along with this tutorial, you need the following:

* The latest version of your favorite web browser, such as Firefox, Chrome, or Safari.

* A static file server to serve the HTML files. You can also use other servers, such as Nginx or Apache. This article uses Node.js.

* Your favorite code editor, such as Visual Studio Code or WebStorm.

‍

Extract text from a PDF in JavaScript using Tesseract.js

In this scenario, you're part of a company that wants to digitize its old invoices. Your job is to develop an OCR application to extract all text from a given PDF invoice.

To perform the OCR in JavaScript, you'll use the Tesseract.js library. This library is a pure JavaScript port of the famous Tesseract OCR engine using WebAssembly. With support for over one hundred languages, text orientation and script detection, and an interface for reading paragraph, word, and character bounding boxes, Tesseract.js is one of the best OCR libraries for JavaScript.

However, Tesseract.js only supports extracting text from an image. That means you need to convert the PDF pages into images first, which is what you'll use PDF.js for. PDF.js not only converts the pages to images but also displays the PDF on the screen simultaneously for a better user experience:

Architecture diagram

‍

To get started, you need to create a new directory for the project:


bash
mkdir ocr_tutorial && cd ocr_tutorial

‍

In this directory, create an HTML file named `index.html` with the following code:


&lt!DOCTYPE html&gt
&lthtml lang="en"&gt
&lthead&gt
	&ltmeta charset="UTF-8"&gt
	&ltmeta http-equiv="X-UA-Compatible" content="IE=edge"&gt
	&ltmeta name="viewport" content="width=device-width, initial-scale=1.0"&gt
    
	&ltscript src="https://cdn.tailwindcss.com"&gt&lt/script&gt
	&lttitle&gtOCR With JS&lt/title&gt
&lt/head&gt
&ltbody class="bg-gray-100 h-screen flex items-center justify-center"&gt
	&ltdiv class="flex h-full w-full"&gt
    	&lt!-- Left Column - PDF Viewer --&gt
    	&ltdiv class="w-1/2 p-4 flex-grow flex flex-col"&gt
        	&lt!-- PDF Viewer Container --&gt
        	&ltdiv id="pdfViewer" class="border border-gray-300 flex-grow mb-4 flex items-center justify-center"&gt
            	&ltcanvas id="canvas"&gt&lt/canvas&gt
        	&lt/div&gt

        	&ltdiv class="flex mb-4"&gt
            	&ltbutton class="bg-blue-500 text-white px-4 py-2 rounded mr-2" id="prev"&gtPrevious&lt/button&gt
            	&ltbutton class="bg-blue-500 text-white px-4 py-2 rounded" id="next"&gtNext&lt/button&gt
            	&ltspan class="text-gray-600 ml-2"&gtPage: &ltspan id="page_num"&gt&lt/span&gt / &ltspan id="page_count"&gt&lt/span&gt&lt/span&gt
        	&lt/div&gt

        	&ltlabel for="uploadPDF" class="block text-sm font-medium text-gray-700"&gtChoose a PDF file:&lt/label&gt
            	&ltinput type="file" id="uploadPDF" class="mt-1 p-2 border rounded-md" accept="application/pdf"&gt

    	&lt/div&gt

    	&lt!-- Right Column - Textbox for Extracted Text -->
    	&ltdiv class="w-1/2 p-4 flex-grow"&gt
        	&lttextarea class="w-full h-full border border-gray-300 p-2" id="extracted-text" placeholder="Extracted Text"&gt&lt/textarea&gt
    	&lt/div&gt
	&lt/div&gt
&lt/body&gt
&lt/html&gt

‍

This code sets up the basic structure of the web page by dividing it into two columns. The left column contains the PDF viewer (the `canvas` element within the `div` with ID `pdfviewer`), a file upload input, two buttons to navigate to the next or previous page, and a page counter. The right column contains a `textarea` where the extracted text is displayed.

The HTML file includes some Tailwind CSS for creating a cleaner UI. You can open this file in your favorite web browser, and you should be able to see the following structure:

The structure of the web page

‍

Add PDF.js

To include the PDF.js JavaScript and CSS files, add the following snippets to `index.html`:


&lt!DOCTYPE html&gt
&lthtml lang="en"&gt
&lthead&gt

	...

	&lt!-- Add this --&gt
	&ltlink rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/4.0.269/pdf_viewer.min.css" integrity="sha512-XYRLVU5scloPRU41FDEe7++i3JZRdR0jwy48SVx1fPptEhzQgMp/gagTyNwZXoNRhNH/A3Aj3emakRatx2OjbQ==" crossorigin="anonymous" referrerpolicy="no-referrer" /&gt
&lt/head&gt
&ltbody class="bg-gray-100 h-screen flex items-center justify-center"&gt
    
	...

	&lt!-- Add this --&gt
	&ltscript src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/4.0.269/pdf.min.mjs" type="module">&lt/script&gt
&lt/body&gt
&lt/html&gt

‍

Note: These URLs correspond to the latest version of PDF.js as of the time of writing this article. If you change these URLs, the code in this article may not work.

Then create a new directory named `js` inside the project root and create a file named `main.js` inside it. This is where the heart of the code will go.

‍

Build the PDF viewer

In this section, you'll build the PDF viewer before you perform the OCR.

Start by declaring some global variables:


var pdfDoc = null, // to hold the current PDF
	pageNum = 1, // current page
	pageRendering = false, // to tell if a page is currently being rendered
	pageNumPending = null; // The page number which is queued to be rendered next

‍

The last two variables are needed to implement the next/previous functionality. You can't render two pages on the same canvas simultaneously, so you have to wait for the previous render to finish. When the user presses the next or previous button, the new page is rendered if `pageRendering` is `false` (ie the previous render is complete). Otherwise, the new page is queued to be rendered by putting the new page number in the `pageNumPending` variable.

Next, create an `async` function named `showPDF`, which loads and renders the PDF:


async function showPDF(pdfData) {
  var { pdfjsLib } = globalThis;
  pdfjsLib.GlobalWorkerOptions.workerSrc = 'https://cdnjs.cloudflare.com/ajax/libs/pdf.js/4.0.269/pdf.worker.min.mjs';


  var loadingTask = pdfjsLib.getDocument(pdfData);
  loadingTask.promise.then(function(pdf) {
	console.log('PDF loaded');
	pdfDoc = pdf;
	document.getElementById('page_count').textContent = pdfDoc.numPages;
	renderPage(pageNum);
    
  }, function (reason) {
	// PDF loading error
	console.error(reason);
  });
}

‍

This function loads a worker from the PDF.js content delivery network (CDN) and uses `getDocument` to load a PDF. The argument `pdfData` that is passed to this function is a Base64-encoded data URL.

After the PDF is loaded, the total page count is set on the `page_count` span, and the `renderPage` function is called with `pageNum` as an argument. The initial value of `pageNum` is `1`, which means the first page is rendered first.

Create another async function named `renderPage`:


function renderPage(num) {
	pdfDoc.getPage(pageNum).then(function(page) {

    	var scale = 1.5;
    	var viewport = page.getViewport({scale: scale});

    	var canvas = document.getElementById('canvas');
    	var context = canvas.getContext('2d');
    	canvas.height = viewport.height;
    	canvas.width = viewport.width;
 
    	var renderContext = {
      	canvasContext: context,
      	viewport: viewport
    	};
    	var renderTask = page.render(renderContext);
    	renderTask.promise.then(function () {
      	pageRendering = false;
      	if (pageNumPending !== null) {
        	renderPage(pageNumPending);
        	pageNumPending = null;
      	}
  });
	});
	document.getElementById('page_num').textContent = num;
}

‍

This function fetches the page as referenced by the `num` variable using the `getPage` function. Then it prepares the `canvas` by setting an appropriate height and width by fetching the viewport from the PDF page. Using the `render` function, the page is then rendered.

If the rendering is complete, the function sets `pageRendering` to `false`. Then if `pageNumPending` is not `null`, the pending page is rendered. Finally, the current page number is set.

In the same file, create a new function called `queueRenderPage` that renders a new page if the previous render has finished. Otherwise, it queues the new page:


function queueRenderPage(num) {
	if (pageRendering) {
    		pageNumPending = num;
	} else {
    		renderPage(num);
	}
}

‍

Then create the event listeners for the next and previous buttons:


function onPrevPage() {
	if (pageNum = pdfDoc.numPages) {
    		return;
	}
	pageNum++;
	queueRenderPage(pageNum);
}
document.getElementById('next').addEventListener('click', onNextPage);

‍

Create a new function named `readFileAsDataURL` that takes the uploaded PDF file and converts it into a data URL:


function readFileAsDataURL(file) {
	return new Promise((resolve, reject) => {
    	let fileReader = new FileReader();
    	fileReader.onload = () => resolve(fileReader.result);
    	fileReader.onerror = () => reject(fileReader);
    	fileReader.readAsDataURL(file);
	});
}

‍

Add the event listener to the file upload input that calls `readFileAsDataURL` and then calls `showPDF` to render the PDF:


var uploadPDF = document.getElementById('uploadPDF');

uploadPDF.addEventListener('change', function(e) {
	let file = e.currentTarget.files[0];
	if (!file) return;
	readFileAsDataURL(file).then((b64str) => {
    	pageNum = 1,
    	pageRendering = false,
    	pageNumPending = null;
    	showPDF(b64str);
    	}, false);
});

‍

In `index.html`, add the `main.js` file as a script:


&lt!DOCTYPE html&gt
&lthtml lang="en"&gt
&lthead&gt
	...
&lt/head&gt
&ltbody class="bg-gray-100 h-screen flex items-center justify-center"&gt
	...
	&lt!-- Add this --&gt
	&ltscript src="./js/main.js" type="module"&gt&lt/script&gt
&lt/body&gt
&lt/html&gt

‍

Because `main.js` is linked as a `module`, you'll run into cross-origin resource sharing (CORS) issues if you try to open `index.html` in your browser. Instead, you should use a static file server to serve the HTML and JavaScript files. To do so, use the `serve` package in Node.js. Install it with `npm install -g serve` and run the following command from the root of the project directory:


bash
serve .

‍

Visit `localhost:3000` in your web browser, and you should be able to open the web page.

Then you can try out the PDF viewer by uploading a PDF file. If you want, you can use this sample PDF file. You should see the first page being rendered:

The first page of the PDF

‍

When you press the Next button, the next page should render:

The second page of the PDF

‍

Perform OCR with Tesseract.js

To perform OCR with Tesseract.js, start by adding the Tesseract.js JavaScript file from the CDN:


&lt!DOCTYPE html&gt
&lthtml lang="en"&gt
&lthead&gt
	...
&lt/head&gt
&ltbody class="bg-gray-100 h-screen flex items-center justify-center"&gt
	...

	&lt!-- Add this --&gt
	&ltscript src='https://cdn.jsdelivr.net/npm/tesseract.js@5/dist/tesseract.min.js'&gt&lt/script&gt
	&ltscript src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/4.0.269/pdf.min.mjs" type="module"&gt&lt/script&gt
	&ltscript src="./js/main.js" type="module"&gt&lt/script&gt
&lt/body&gt
&lt/html&gt

‍

Note: These URLs correspond to the latest version of Tesseract.js as of the time of writing this article. If you change these URLs, the code in this article may not work.

Then in `main.js`, add a global variable to hold the `TesseractWorker`:


js
var tesseractWorker = null;

‍

Create a new function called `initTesseract` that initializes the worker:


async function initTesseract() {
	tesseractWorker = await Tesseract.createWorker('eng', 1, {workerPath: 'https://cdn.jsdelivr.net/npm/tesseract.js@v5.0.0/dist/worker.min.js'});
}

‍

The worker is loaded from the Tesseract.js CDN, and English is chosen as the language.

Create a function called `loadImage` that takes the rendered page in the canvas and converts it into an `Image` object:


async function loadImage(url) {
	return new Promise((resolve, reject) => {
    	const img = new Image();
        	img.addEventListener('load', () => resolve(img));
        	img.addEventListener('error', (err) => reject(err));
        	img.src = url;
    });
}

‍

Then create a function `extractText` that performs the OCR on the image by calling the `recognize` function from Tesseract.js:


async function extractText() {
	await initTesseract();
	let imageString = document.getElementById('canvas').toDataURL();
	let image = await loadImage(imageString);
	const { data: { text } } = await tesseractWorker.recognize(image);
	document.getElementById('extracted-text').textContent = text;
	await tesseractWorker.terminate();
}

‍

Finally, call `extractText` in the `renderPage` function right after the page rendering is complete:


function renderPage(num) {
	pdfDoc.getPage(pageNum).then(function(page) {

    	...

    	renderTask.promise.then(function () {
        	extractText(); // Add this
        	...
        });
    });
	...
}

‍

Now, it's time to test it. Reload the web page and upload the PDF file. You should see the first page on the left side and the extracted text on the right:

The extracted text from the first page

‍

Press Next to move to the next page. The extracted text should update with the new text from the second page:

The extracted text from the second page

‍

Congratulations! You have successfully performed OCR with Tesseract.js.

‍

Conclusion

In this article, you learned how to perform OCR in JavaScript using Tesseract.js and PDF.js. You also learned how to use PDF.js to render a PDF, convert it to an image, and then use Tesseract.js to extract the text using OCR. This configuration can help extract text from simple PDFs.

You can find the complete code for this tutorial on GitHub.

When working with PDFs, you also need to figure out how to handle eSignatures. Digitally signing a document that is also legally binding and secure has always been a challenge to implement. But with Dropbox Sign, you can seamlessly integrate eSignature functionality into your applications. With document templates, automatic reminders, mobile-friendly signing, and affordable pricing, Dropbox Sign is the perfect choice for your document signing needs.

効率を維持

完了しました。受信トレイをご確認ください。

Thank you!
Thank you for subscribing!

Lorem ipsum

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Lorem ipsum
右矢印のアイコン
閉じるアイコン

Up next:

手書き署名のクローズアップ イラスト。最新のデジタル署名ソリューションを表しています。
開発者向け情報
15
分(記事閲覧時間)

Dropbox Sign と Ruby on Rails の連携:チュートリアルで詳しい手順をご紹介

手書き署名のクローズアップ イラスト。最新のデジタル署名ソリューションを表しています。
開発者向け情報
15
分(記事閲覧時間)

Dropbox Sign vs. SignNow for developers

e ブック

B2B 市場での取引の煩雑さを電子署名で解消する方法

製品
Dropbox SignDropbox Sign APIDropbox Faxインテグレーション
Dropbox Sign が選ばれる理由
電子署名ドキュメントへの署名PDF への入力と署名オンライン契約書電子署名の作成署名エディタWord ドキュメントへの署名
サポート
ヘルプセンターセールス担当に連絡サポートへのお問い合わせCookie の管理スタート ガイド:Dropbox Signスタート ガイド:Dropbox Sign API
リソース
公式ブログお客様の体験談リソース センター適法性ガイドトラスト センター
パートナー
戦略的パートナーパートナー ロケーター
会社
採用情報利用規約プライバシー
Facebook のアイコンYouTube のアイコン

利用可能なお支払い方法

Mastercard のロゴVISA のロゴAmerican Express のロゴDiscover のロゴ
CPA 準拠のバッジHIPAA 準拠のバッジSky High Enterprise Ready のバッジISO 9001 認証のバッジ

Dropbox Sign の電子署名は、米国、欧州連合、英国などを含め、世界中の多くの国で法的に有効です。
詳細については、利用規約およびプライバシー ポリシーをご覧ください。