Tutorial: Document Text Selection

Document Text Selection

Web Document Viewer supports optional text layer on top of the displayed documents. End user could select and copy document text or search through document text using default WDV UI. Alternatively corresponding APIs are available for application.

Enabling Text Layer Support

Text selection support is disabled by default. To enable it, allowtext config option should be set to true during WDV construction.

After that JavaScript control will start requesting text data from server and will add additional toolbox button will be added to switch input mode to text selection or back to regular "pan" mode. End user will be able to switch input mode to text.

Text selection input mode is independent from any other viewer operations, i.e. it's not possible to interact with annotations or PDF forms unless input mode is switched back.

text columns

Text selection have several related configuration options that allow to modify default behavior. For details see config.mousetool.text documentation.

For example, to enable copying selected text by ctrl+c, config.mousetool.hookcopy should be used.

Document Text Search

Additionally document search UI could be turned on using config.mousetool.text.allowsearch config option.

If enabled, default text search UI controls will be added to document toolbar.

Note, that default search is implemented using public API and could easily be rewritten by the application if UI appearance or business logic should be changes.

Document Text Search API

Text search is asynchronous operation in wdv. Each searchOnPages method call starts new operation and abandon all previous search operations.

Executing search through API also triggers highlighting search results in UI and selecting current search match.

Search results are passed to application through TextSearchIterator that is passed as a parameter to callback function provided by application when calling corresponding APIs. That iterator could be used to advance current match back and forth and thus proceed search.
For example,

var _viewer = new Atalasoft.Controls.WebDocumentViewer({
    //.... configuration omitted
});

// cache the current iterator.  
var _currentIterator;
// search starting from the beginning or the document, i.e. from page 0.
// search ending at the page with index 7
// viewer start scrolling to the next search result starting from the page 3,
// i.e. results on pages 0-2 will be highlighted but skipped
_viewer.text.searchOnPages('find me', 0, 7, 3, onNextMatch);

// this demonstrates "find next action"
function onFindNext() {
    if (_currentIterator && _currentIterator.isValid()) {
        _currentIterator.next(onNextMatch);
    }
}

// this demonstrates "find previous action"
function onFindPrevious() {
    if (_currentIterator && _currentIterator.isValid()) {
        _currentIterator.prev(onNextMatch);
    }
}

// callback signature is the same for search, next and prev functions.
function onNextMatch(iterator, match) {
    // check that our search is not abandoned yet.
    if (iterator.isValid()) {
        //  cache active iterator.
        _currentIterator = iterator;
        // select first found result with word precision. If match.word parameter is omitted, whole line will be selected.
        _viewer.text.selectPageText(match.page, match.region, match.line, match.word);
    }
}

See search API for details.

Text selection Document Types

From a text selection point of view, thee are three classes of documents.

PDF documents

Pdf documents are separated in two classes.

  • Documents that do not have a text layer. They should be treated as regular images.

  • Documents that do have a text layer. Atalasoft DotImage provides out of the box extraction support and an extraction API. This behavior is available if the PDF text extraction (TextExtract) licensing flag is enabled.

    Note, that for large PDF files, or files containing a complicated structure or a lot of symbols, extraction could take some time. This may cause delays on the client side between presenting page the image to end user and getting page text loaded, i.e. "loading" cursor will be shown when attempting to select text on the page that didn't load text yet. Thus, for such documents, developers should consider pre-generating extraction data, since this would improve overall performance.

MS Office documents

MS Office documents can be rendered using two dotImage decoders: OfficeAdapterDecoder and OfficeDecoder.

  • OfficeDecoder Supports both rendering and text extraction. For this decoder Atalasoft DotImage provides out of the text selection support.
  • OfficeAdapterDecoder Uses MS Office Automation for documents rendering. For this decoder output images that should be treated as regular images.

Regular Images

Text data should be provided by the application, for example extracted by OCR engines. Since OCR is a performance-consuming operation, and not intended to run on web server, it doesn't make much sense to provide out of the box extraction support.

OCR engine output could be translated to the Atalasoft DotImage object model using the Atalasoft.Imaging.WebControls.OCR.JsonTranslator class. Keep in mind that if an OCR engine performs internal image transformations, extracted text coordinates could differ from actual text position on the source image.

See Programming Document Viewer Server tutorial for details how to provide document text data using WebDocumentRequestHandler.PageTextRequested event.