![[graphic]](http://www.google.com.hk/patents?id=pFh4AAAAEBAJ&hl=zh-TW&ie=Big5&output=text&pg=PA7&img=1&zoom=3&hl=zh-TW&q=&cds=1&sig=ACfU3U2kAPFfC8YzwI4nfXPv-bnbyQHlzQ&edge=0&edge=stretch&ci=143,384,694,120)
![[graphic]](http://www.google.com.hk/patents?id=pFh4AAAAEBAJ&hl=zh-TW&ie=Big5&output=text&pg=PA7&img=1&zoom=3&hl=zh-TW&q=&cds=1&sig=ACfU3U2kAPFfC8YzwI4nfXPv-bnbyQHlzQ&edge=0&edge=stretch&ci=184,610,60,164)
ige and noise. Tbe linear filter's cai»ler runcuon w given by lheir« M the target traa$e343fflst wbk& the template is befog, cctnrwed, h. cum be il*!»n tltu the match occur? at tho« paicU that tbe template Miches besjt with tbe image. Clearly the parameter* over which the fiher output is opiimfc/pd h» tlx displacement. Thas the msk'hcd Slex can he anpfemeiKcd« an ares corwlator with* jfffefweessins filter- For the case of Qausnaa DtKSe, this filter is identiry.
Now ft? fe* the method of the .uuch ifsei£ tficrc are two different tppro^hes ta if. daily the easiest <aw is the diroct *yrch approach 161. He*wcr this is feasible «Uy for relatively small images due to the eocrtpiruifaaalenrnpi«xry for an « w search afeal w described before Two dimensional lojai rthmic- *ancb methods (which have a much tower lojsrilhmicciMnpbsiiy) can t» mod to <agnifio<tfiy retire the cornpuuhoru] overhead 16). However for the class of images iLui we are iraerested in, tfiw Is tnanrdieahle »the function that w mini raized in not convex e\«n 3oc«lJy for the domain representing the capture refttoa. Thire the gradient of the wuech error n any location docml Beeesgarily indicate the direction in *hicfe tfie warcii shuuid pr«fi«iR. Time method? are more appropriate fof images with pay value*. A hierarchical c^anw-uvlkw strategy i-« employed when the ohur\«d inuge ts way brge for * log&nifunetffJkieoey. Fita a coarse verstai of the image is comnsred agains a similar toarae wrsoxj of the template. For Jll area* of potential match, »<Mrdiio$ if done in the oexl hither Akoujijoji and so fin ]5J. The main idea fcebind the use of this *» called scale spice ... feature? would pemst through the coarse ta fine scales wn if their locaUoa mjghl he distorted somewhat. Ttuw they can be uack*d back down the scale path to be Iocs fti exactly in the »f«o scale The rui.»t important part of the scale spao? tteory » Oie w called causa]m or ahxmnniciiif pioperty wbkb Mates iliai any fcatine auac preset* 9t yxne ictle mux also he present *t lower tc*kn {4, If. There are a couple of jyoMens in directly applying these auslhods to voagxa caaiLitmji of engiDeenag dtawiugs One is thai an wage of a higher scale is oitaKivd by (Us Oattvsian »iOi»thinj »>/tK; ... Since need to haw the wag* processed vwl stored al feverx) scale, far a Jsnte image, liiis imoivts a large mesutry tequiretmm and abo »igiufirani enrepmatwn for the Oausian sm-xnhin^ Further, fiw the case of hiaary irasjies that mighi he ftfturoe, this might actually inaekae th? compuf^n.Kal <ti«rh»d benuw all the mt»3T]iedi*e iim^s are m nune bcury Also, ttus »
in tbe next step, we go dnwna nintierautfe w nqesoawgyiQ identify the potential nutcbes And (ben fciaUy test then inthe finest wale w cwifirm the enatebe*. For the ease iif routitMis, «e noeduvrouic the untrtlatd and perform the second step.
Aa tnaUtoncd rcr>sncdly before, th* main purpow uf this step is ta qmefciyrule out atz*i th& are uutikfiy to be Ckhin date ousb:har«Ks
Tbe iaputiou$« j$ < cotWcxicn of htniry bitj orrmpreKied usias ihc rif fiwrrsu. Since (bis ts o/l?o too brge to vvntliw «uch a lanj« uuage. we *(.-al* it down to V UKfe of the original sou. Odc tssy way to scale it down Wiiuld be jira to corunier 1 out of every 4 pjuls in both, the a. aoj y direction. etwr, <htf Ktiilnni mag? bt'Ci^naos ^-Mrj- and broken. To nvnid litis, we first pranote the uiuBe to grey tev«r fMOi tnnary. and then re-threshold it adjrrtiwly to hunry. This sjm'fkaath' mprovothe^xab'tyof Jheseated inwtttmxge The threshotd value Lscotnrtuted adjpuveiy foe a local uindtw.; Tbe bottom lots is thai at the eottcltiuoa of this sup, wu orjee u^ain end up with a binary iimct albeit a much snulhs one. We shall call this una jte I,
The neat «ep is to suhd)\'tdf die output unage that we get frjyn ihe pfwjm:* «iicr> into Mx(A hlivk«t. Actually we create an image fti wtiere each pixel cjirre^ondstoarVt-^bhick «if I, The value of each pttci a ei^ual tu the oitfaber of black pUois in the &4x64 block. Once ttusn crejuod, for each putel in we see ifthepcul valiio is Larger than a pit(ie»efiniped threshold. If so. then we fceop the cornspiirtdiBg bicick in I for ftnher consideTauoa, ebw it is not c«ct>iderird further airylaafi'. Tbe choice of the litK^bnld can he dependent «a the umjiktii. but ean be mail mdqvndctx of it ts «ufl If we know the tempbie in adduce, indif we find u busy Lb. ihore aie-a rchuiwty Urge aur*rwt of bUck puels, we would tue a hi$h<r threshed than t*h>xx *e dun"i know tbe tmpbie *nr!> ori. OdKrwviewcuscav^Uueofthc ttm>4toldtal>a 12ft which rejpnsfrds two linex acruss the Mock. Thus the afifunptKui M thai the template u ffl^j- to be xTKtre lhan jtca two tinea. Clearly, in is entirely ponaWe thai w mijht xni»* small parts of a pattern, hut a* we will «ee later, a. iqns aj some pons of the pattern « u appearf in the tmaee u cull tindrt con$tder»> tion. we can still reuiexe the abject location.
FoIEwtns the abrnt the reti of the image I thai is Hill wtde^rons|derau<w^t.e l^r u^t^w tu|c^
1
METHOD AND APPARATUS FOR EXTRACTING ANCHORABLE INFORMATION UNITS FROM COMPLEX PDF DOCUMENTS
5
This application claims the benefit of U.S. Provisional Application No. 60/256,293, filed Dec. 18, 2000.
BACKGROUND OF THE INVENTION
10
1. Field of the Invention
The present invention is concerned with processing multimedia data files to provide information supporting user navigation of multimedia data file content.
2. Background of the Invention 15 The demand for hypermedia applications has increased
with the growing popularity of the World Wide Web. As a result, a need for an effective and automatic method of creating hypermedia has arisen. However, the creation of hypermedia can be a laborious, manually intensive job. In 20 particular, hypermedia creation can be difficult when referencing content in documents including images and/or other media.
In many cases, the hypermedia authors need to locate Anchorable Information Units (AIUs) or hotspots that are 25 areas or keywords of particular significance, and make appropriate hyperlinks to relevant information. In an electronic document, a user can retrieve associated information by selecting these hotspots as the system interprets the associated hyperlinks and fetches the corresponding relevant 30 information.
Previous research in this field has taken scanned bitmap images as the input to a document analysis system. The classification of the document system is often guided by a priori knowledge of the document's class. There has been 35 little work done in using postscript files as a starting point for document analysis. Certainly, if a postscript file is designed for maximum raster efficiency, it can be a daunting task even to reconstruct the reading order for the document. Previous researchers may have assumed that a well-struc- 40 tured source text will always be available to match postscript output and therefore working bottom-up from postscript would seldom be needed. However, PDF documents can be generated in a variety of ways including an Optical Character Recognition (OCR) based route directly from a bit- 45 mapped page. The extra structure in PDF, over and above that in postscript, can be utilized towards the goal of document understanding.
Previous work proposed methods related to the understanding of raster images. Being an inverse problem by 50 definition, this task cannot be accomplished without making broad assumptions. Directly applying these methods on PDF documents would make little sense as they are not designed to make use of the underlying structure of PDF files, and thus will produce undesirable results. 55
In contrast to the geometric layout analysis, logical layout analysis has received very little attention. Some methods of logical layout analysis perform region identification or classification in a derived geometric layout. However, these approaches are primarily rule based and thus, the final- 60 outcome depends on the dependability of the prior information and how well the prior information is represented within the rules.
Systems such as Acrobat do not have the ability to process images. Rather Acrobat runs the whole document through an 65 OCR system. Clearly, OCR is not able extract objects, but even in the case of understanding text the output can be
2
unreliable as a general-purpose OCR can be error prone when used to understand scanned in images directly.
Therefore, a need exists for a method of analyzing and extracting text from PDF documents created using various means.
SUMMARY OF THE INVENTION
According to an embodiment of the present invention, a system is provided for processing a multimedia data file to provide information supporting user navigation of multimedia data file content. The system includes a content parser to identify text and image content of a data file, and an image processor for processing said identified image content to identify embedded text content. The system further includes a text sorter for parsing said identified text and said identified embedded text to locate text items in accordance with predetermined sorting rules, and memory for storing a navigation file containing said text items.
The navigation file links to at least one internal document object. The navigation file links to at least one external document object.
The image processor includes a black and white image processor including a pixel smearing component reducing text to a rectangular block of pixels, and an image filtering component for cleaning a smeared image.
The content parser applies text extraction rules to identify text and identify a document structure, wherein the document structure defines a context for identified text. The content parser applies pre-defined hierarchical rules for determining a level of identified text.
The image processor applies object templates to identify embedded text.
The system refines a search resolution during a text identifying process to determine a location of the embedded text within an image.
Identified text comprises hyperlinks.
According to another embodiment of the present invention, a graphical User interface system is provided supporting processing of a multimedia data file to provide information supporting user navigation of multimedia data file content. The graphical User interface system includes a menu generator for generating, one or more menus permitting User selection of, an input file and format to be processed, and an icon permitting User initiation of generation of a navigation file supporting linking of input file elements to external documents by parsing and sorting text and image content to identify text for incorporation in a navigation file.
Identified text comprises hyperlinks.
The navigation file further comprises links to at least one internal document object.
According to an embodiment of the present invention, a method is provided for creating an anchorable information unit in a portable document format document. The method includes extracting a text segment from the portable document format document, determining a context of the segment, wherein the context is selected from a context sensitive hierarchical structure, and defining the text segment as an anchorable information unit according to the context.
The portable document format document includes one or more textual objects and one or more non-textual objects, wherein the objects include textual segments.
Determining the context includes comparing the text segment to a plurality of known patterns within the portable document format document, and determining the context
3
upon determining a match between the text segment and a known pattern of the portable document format document.
Extracting text further includes extracting text form an image of the portable document format document, determining an image type, wherein the type is one of a black and 5 white image, a grayscale image, and a color image, and processing the image according to the type.
The portable document format document includes a known context sensitive hierarchical structure. The context sensitive hierarchical structure, including the anchorable 10 information unit, is searchable. The context includes a location of the extracted text segments. Determining the context includes determining a location and a style of the text segment.
The method further includes storing the text segment in a 15 Standard Generalized Markup Language syntax using a predefined grammar.
The achorable information unit is automatically hyperlinked.
According to an embodiment of the present invention, a 20 method is provided for creating an anchorable information unit file from a portable document format document. The method includes parsing the portable document format document into textual portions and non-text portions. The method further includes extracting structure from the textual 25 portions and the non-text portions, and determining text within textual portions, and text the non-text portions. The method hyperlinks a plurality of keywords within the textual portions and non-text portions to at least one related document. 30
Parsing further comprises the step of differentiating color image content, black-and-white content, and grayscale content.
Extracting further comprises determining a level for extracted textual portions, associating the context with the 35 text, and pattern matching extracted text to the portable document format document to determine a context. The level is one of a paragraph, a heading and a subheading. Pattern matching includes determining a median font size for the portable document format document, comparing a 40 font size of the extracted text to the median font size for the portable document format document, and determining a context according to font size.
Hyperlinking includes creating the anchorable information unit file, wherein the plurality of keywords are anchor- 45 able information units.
According to an embodiment of the present invention, a program storage device is provided, readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for creating an anchor- 50 able information unit file from a portable document format document.
BRIEF DESCRIPTION OF THE DRAWINGS
55
Preferred embodiments of the present invention will be described below in more detail, with reference to the accompanying drawings:
FIG. 1 is a flow chart showing an overview of a method of creating an anchorable information unit according to an 60 embodiment of the present inventin;
FIG. 2 is a flow chart showing a method of creating an anchorable information unit according to an embodiment of the present invention; and
FIGS. 3a—b are a flow chart showing a method of creating 65 an anchorable information unit according to an embodiment of the present invention.
4
FIG. 4 shows a graphical User interface display supporting processing of a multimedia data file to provide information for use in navigating multimedia data file content, according to an embodiment of the present invention.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
The present invention provides an automated method for locating hotspots in a PDF file, and for creating crossreferenced AIUs in hypermedia documents. For example, text strings can point to a relevant machine part in a document describing an industrial instrument.
It is to be understood that the present invention may be implemented in various forms of hardware, software, firmware, special purpose processors, or a combination thereof. In one embodiment, the present invention may be implemented in software as an application program tangibly embodied on a program storage device. The application program may be uploaded to, and executed by, a machine comprising any suitable architecture. Preferably, the machine is implemented on a computer platform having hardware such as one or more central processing units (CPU), a random access memory (RAM), and input/output (I/O) interface(s). The computer platform also includes an operating system and micro instruction code. The various processes and functions described herein may either be part of the micro instruction code or part of the application program (or a combination thereof) which is executed via the operating system. In addition, various other peripheral devices may be connected to the computer platform such as an additional data storage device and a printing device.
It is to be further understood that, because some of the constituent system components and method steps depicted in the accompanying figures may be implemented in software, the actual connections between the system components (or the process steps) may differ depending upon the manner in which the present invention is programmed. Given the teachings of the present invention provided herein, one of ordinary skill in the related art will be able to contemplate these and similar implementations or configurations of the present invention.
The PDF files under consideration can include simple text, or more generally, can include a mixture of text and a variety of different types of images such as black and white, grayscale and color. According to an embodiment of the present invention, the method locates the text and non-text areas, and applies different processing methods to each. For the non-text regions, different image processing methods are used according to the type of images contained therein.
The extraction of AIUs is important for the generation of hypermedia documents. However, for some PDF files, e.g., those that have been scanned into a computer, this can be difficult. According to an embodiment of the present invention, the method decomposes the document to determine a page layout for the underlying pages. Thus, different methods can be applied to the different portions of a page. A geometric page layout of a document is a specification of the geometry of the maximal homogeneous regions and their classification (text, table, image, drawing etc). Logical page layout analysis includes determining a page type, assigning functional labels such as title, note, footnote, caption etc., to each block of the page, determining the relationships of these blocks and ordering the text blocks according to a reading order.
OCR has had an important role in prior art systems for determining document content. Accordingly, OCR has
« 上一頁繼續 » |