5 comments

  • sync 0 minutes ago
    This is essentially a (vibe-coded?) wrapper around PaddleOCR: https://github.com/PaddlePaddle/PaddleOCR

    The "guts" are here: https://github.com/majcheradam/ocrbase/blob/7706ef79493c47e8...

  • v3ss0n 52 minutes ago
    How this is better over Surya/Marker or kreuzberg https://github.com/kreuzberg-dev/kreuzberg.
    • jadbox 19 minutes ago
      Sounds like someone needs to run their own test cases and report back on which solution does a better job...
  • hersko 1 hour ago
    I have a flow where i extract text from a pdf with pdf-parse and then feed that to an ai for data extraction. If that fails i convert it to a png and send the image for data extraction. This works very well and would presumably be far cheaper as i'm generally sending text to the model instead of relying on images. Isn't just sending the images for ocr significantly more expensive?
    • saaaaaam 43 minutes ago
      There was an interesting discussion on here a couple of months back about images vs text, driven by this article: https://www.seangoedecke.com/text-tokens-as-image-tokens/

      Discussion is here: https://news.ycombinator.com/item?id=45652952

    • trollbridge 45 minutes ago
      I always render an image and OCR that so I don’t get odd problems from invisible text and it also avoids being affected by anything for SEO.
    • mimim1mi 1 hour ago
      By definition, OCR means optical character recognition. It depends on the contents of the PDF what kind of extraction methodology can work. Often some available PDFs are just scans of printed documents or handwritten notes. If machine readable text is available your approach is great.
  • sgc 1 hour ago
    How does this compare to dots.ocr? I got fantastic results when I tested dots.

    https://github.com/rednote-hilab/dots.ocr

    • mjrpes 57 minutes ago
      Ocrbase is CUDA only while dots.ocr uses vLLM, so should support ROCm/AMD cards?
  • mechazawa 2 hours ago
    Is only bun supported or also regular node?