با برنامه Player FM !
Episode 104. It's all about Apache Tika, the project that lets you index EVERYTHING.
Manage episode 413288553 series 2852088
So we continue to have guests in our show to talk to us about interesting things... This time is about Apache Tika. This is an incredible tool to do search file processing and metadata extraction. Think about that you have tons of unstructured files, like emails, or documents, and you want to extract, index and then search theses. This is Tika's purpose. And who best to walk us through how it does its magic that its Project Management Committee (PMC) Chair, Tim Allison!
So take a listen as we go deeper on ingesting tons of content (which is fundamental for things like training LLMs).
http://www.javapubhouse.com/datadog We thank DataDogHQ for sponsoring this podcast episode
Don't forget to SUBSCRIBE to our cool NewsCast OffHeap! http://www.javaoffheap.com/
Apache Tika * https://tika.apache.org/
OpenSearch Project and OpenSearch Neural Plugin Tutorials * https://opensearch.org/ * https://opensearch.org/docs/latest/search-plugins/neural-search/ * https://opster.com/guides/opensearch/opensearch-machine-learning/how-to-set-up-vector-search-in-opensearch/ * https://opster.com/guides/opensearch/opensearch-machine-learning/opensearch-hybrid-search/ * https://sease.io/2024/01/opensearch-knn-plugin-tutorial.html * https://sease.io/2024/04/opensearch-neural-search-tutorial-hybrid-search.html
Selected Advanced File Processing toolkits/services * https://unstructured.io/ * https://aws.amazon.com/textract/ * https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence
Selected Hybrid Search/RAG toolkits (there are _MANY_ others!) * Haystack: https://haystack.deepset.ai/ * LangChain: https://www.langchain.com/ * LangStream: https://langstream.ai/
Search/Relevance Conferences * https://haystackconf.com/ * https://2024.berlinbuzzwords.de/ * https://mices.co/
Tim's personal project * JavaFX (ahem) tika-config writer UI: https://github.com/tballison/tika-gui-v2
Do you like the episodes? Want more? Help us out! Buy us a beer! https://www.javapubhouse.com/beer
And Follow us! https://www.twitter.com/javapubhouse
107 قسمت
Manage episode 413288553 series 2852088
So we continue to have guests in our show to talk to us about interesting things... This time is about Apache Tika. This is an incredible tool to do search file processing and metadata extraction. Think about that you have tons of unstructured files, like emails, or documents, and you want to extract, index and then search theses. This is Tika's purpose. And who best to walk us through how it does its magic that its Project Management Committee (PMC) Chair, Tim Allison!
So take a listen as we go deeper on ingesting tons of content (which is fundamental for things like training LLMs).
http://www.javapubhouse.com/datadog We thank DataDogHQ for sponsoring this podcast episode
Don't forget to SUBSCRIBE to our cool NewsCast OffHeap! http://www.javaoffheap.com/
Apache Tika * https://tika.apache.org/
OpenSearch Project and OpenSearch Neural Plugin Tutorials * https://opensearch.org/ * https://opensearch.org/docs/latest/search-plugins/neural-search/ * https://opster.com/guides/opensearch/opensearch-machine-learning/how-to-set-up-vector-search-in-opensearch/ * https://opster.com/guides/opensearch/opensearch-machine-learning/opensearch-hybrid-search/ * https://sease.io/2024/01/opensearch-knn-plugin-tutorial.html * https://sease.io/2024/04/opensearch-neural-search-tutorial-hybrid-search.html
Selected Advanced File Processing toolkits/services * https://unstructured.io/ * https://aws.amazon.com/textract/ * https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence
Selected Hybrid Search/RAG toolkits (there are _MANY_ others!) * Haystack: https://haystack.deepset.ai/ * LangChain: https://www.langchain.com/ * LangStream: https://langstream.ai/
Search/Relevance Conferences * https://haystackconf.com/ * https://2024.berlinbuzzwords.de/ * https://mices.co/
Tim's personal project * JavaFX (ahem) tika-config writer UI: https://github.com/tballison/tika-gui-v2
Do you like the episodes? Want more? Help us out! Buy us a beer! https://www.javapubhouse.com/beer
And Follow us! https://www.twitter.com/javapubhouse
107 قسمت
همه قسمت ها
×به Player FM خوش آمدید!
Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.