Figurit Homepage
Prototype of a platform for document retrieval and advanced indexing logo
PROJECT REFERENCE

Prototype of a platform for document retrieval and advanced indexing

HCFRN/
C2IA


Customer: Haut Comité Français pour la Résilience Nationale (HCFRN)

Programme: C2IA

Supply Chain: HCFRN > CS Group SPACE

Context

DOCIA: Operational tool for information processing and retrieval. The idea is not to change the way in which data is archived, but to rely effectively on the organizational and technical means deployed or in the process of being deployed.

CS Group responsabilities for Prototype of a platform for document retrieval and advanced indexing are as follows:

  • Need analysis
  • Design & development

Main Picture

The features are as follows:

  • Full text, metadata and temporal search with highlights
  • 5 views:
    • Details is text blocks corresponding to the search with the main metadata
    • Table is the results in list format
    • Directory is the results in the document tree structure
    • Statistics is the pie chart of documents found by type, average size etc…
    • Map is the locations found in documents on a background map
  • Shopping cart:
    • Import/export/permalians
    • Suggested documents
  • Upload, add, update, delete files

Project implementation

The project objectives are as follows:

  • Enable the search in a large quantity of heterogeneous and unorganized documents
  • Intelligent use of data, linking, cross-referencing
  • Monitoring of local documents, websites, RSS feeds
  • Applications: Operational Mapping, Surveillance, Decision Support

The processes for carrying out the project are:

  • Agile Methodology

Technical characteristics

The solution key points are as follows:

  • Advanced indexation:
    • Identification of duplicates
    • Text extraction optimization
    • Header / footer
    • Image pre-processing
    • Metadata extraction
  • Scalable and extensible
  • Minimal use of resources
  • Logs management
  • Indexing metrics by file type
  • Index sharing and export

Archi Picture

The main technologies used in this project are:

Domain Technology(ies)
Operating System(s) Linux, HTML 5 Client
Programming language(s) HTML
Interoperability (protocols, format, APIs) XMPP, WMS, WMTS, TMS, FTP, POSIX, Ms SharePoint, PDF, SSO, OpenSearch, Geo/Time
Production software (IDE, DEVOPS etc.) Docker, Swagger, Git
Main COTS library(ies) ElasticSearch, PyTorch, Spark