MarkLogic 4.1 Released

MarkLogic have released version 4.1, not long after 4.0 came out. I haven’t had a chance to use it yet but it looks like there are a few cool new features:

  • XML schema validation
  • Japanese language support
  • Task scheduler
  • JSON support
  • HTTP app servers supports REST, URL rewriting and HTTPS
  • MarkLogic Application Services – includes a new search API that seems to
    be a move to provide built-in functionality like lib-search, and a GUI
    tool to develop demos.

The JSON support would have been really valuable in the project I just finished, and I can see how useful a task scheduler would be for maintenance jobs etc. The improvements to the HTTP app server don’t really interest me. Why would you want to use MarkLogic as an HTTP server in any real production environment? Despite these improvements, it’s woefully under-featured compared to Apache or IIS, and is an expensive MarkLogic cluster really the right place to be handling HTTP requests?

And as for the application builder GUI… I haven’t checked it out just, but the idea scares me to death. I can just imagine people knocking something up that nearly works and then thinking creating a proper, scalable, production-ready application should be just as quick and easy.

Now just need to find a way to justify upgrading our cluster – doubt it will happen any time soon unfortunately..

Tagged , | Leave a comment

MarkLogic improvements

I was recently asked to write a list of improvements I’d like to see in future MarkLogic releases – here’s what I came up with:

Server

  • Error messages are sometimes misleading and unhelpful
  • Logging restricted to global errorLog.txt
  • Can’t index an element at a specific XPath (can only specify QName which is often not unique)
  • Registered queries are too much effort to use – they don’t persist after restart and have to be called by id
  • Stemmed/Unstemmed queries search different content base because stemmed searches are single language
  • XML update functions (eg xdmp:node-replace) don’t work on in-memory nodes
  • No built-in XSLT engine
  • Ease of scalability by adding ‘commodity hardware’ undermined by licensing model per CPU


Admin site

  • Hard to navigate – tree structure confusing
  • Limited help – just loads of text fields with no hints to valid settings
  • Times out when the server has a problem.
  • API docs aren’t searchable (why not index in MarkLogic?!)


Developer Tools (e.g. Record Loader, CQ, Corb, lib-search)

  • Suffer from not being official products – very limited support, erratic development
  • In general they are not robust, are poorly designed, hard to set up and use (e.g. CQ UI is very basic, record loader leaks memory)
  • All poorly documented
  • Common development requirements not catered for. e.g. import/export server configuration
  • Mailing list is only help availalble – would be nice to have forums and online tutorials
  • XCC – .Net version is just a wrapper around Java so incurs performance penalty

I should point out that I really like working with MarkLogic, and it’s easy to take for granted what a powerful piece of software it is. Most of my gripes concern the tools and documentation that support it. I feel they’ve been neglected so far which is a shame but I guess understandable for a new and quickly evolving product.

Tagged , | Leave a comment

MarkLogic searches – stemmed vs unstemmed

It turns out that setting a cts:query to be stemmed or unstemmed has a nasty side-effect that can be a real problem if you are searching content in multiple languages. In theory a stemmed search should always return the same or more results than an unstemmed query, right? Well that’s true, as long as both queries are searching the same content…

And here’s the problem: In MarkLogic, stemmed queries only search content in one language.

cts:search(
    fn:doc(),
    cts:word-query('chicken', 'unstemmed')
)

This will search for the word ‘chicken’ in all documents, regardless of language.

cts:search(
    fn:doc(),
    cts:word-query('chicken', 'stemmed')
)

This will search for stems of the word ‘chicken’ in documents of the database default language only.

I was shocked by this, turns out all my searches were omitting the 10% or so of documents not in English! I wrongly presumed that all content would be searched but stemming would only have an effect in english content.

So what do you do if you need to use stemming but also want to search content across multiple languages? There is no fantastic solution but here are 2 hacks that get the job done:

1. Set all xml:lang attributes everywhere to the same language, and store the actual language somewhere else instead. This is nice because it doesn’t require changing any xquery, but may introduce complications further down the line.

2. Modify all stemmed queries to an or query – stemmed or unstemmed.

cts:search(
  fn:doc(),
  cts:or-query(
    cts:word-query('chicken', 'stemmed'),
    cts:word-query('chicken', 'unstemmed')
  )
)

This searches en documents with stemming, and then all documents without stemming. It will incur a performance hit, as the query is more complex. How bad will depend on your content, index settings etc. On my database the impact was negligible.

Which option is better I’m not sure – neither are ideal. Here is a discussion I had about this issue on the MarkLogic mailing list:
http://markmail.org/thread/e5hdnjg5rcayxxhl

Tagged , | 1 Comment

xdmp:strftime() requires year >= 1900

There’s a weird bug in the xdmp:strftime() function in MarkLogic, I
think inherited from elsewhere, that causes an error when supplied a year less than 1900.

xdmp:strftime('%Y',
        xs:dateTime('1890-01-01T00:00:00'))

SVC-STRFTIMEYEAR: xdmp:strftime(“%Y”, xs:dateTime(“1890-01-01T00:00:00″)) — Year cannot be formatted: 1890

As a work-around call this function in a custom namespace:

define function strftime($format as xs:string, $dt as xs:dateTime)
as xs:string
{
  fn:replace(
    xdmp:strftime(fn:replace($format, '%Y', '#Y'), $dt),
    '#Y',
    fn:string(fn:year-from-date(xs:date($dt)))
  )
}
Tagged , | Leave a comment

AuthorMapper.com

My first big project with MarkLogic went live about a month back: AuthorMapper.

We loaded all Springer journal content (3m+ articles) into MarkLogic and used lib-search as a basis for a faceted search interface. Using Google’s geocoding service we store the long/lat coordinates of the authors enabling plotting the results on the Google map. The JavaScript framework Ext JS is used for the search form and some other nice UI tweaks like resizing the map.

Powered by AuthorMapper.com

Tagged , , , , | Leave a comment

More on MarkLogic 4

Just remembered another feature that I’d really like to see in MarkLogic.. the admin interface currently shows the size of a database in mb. It would be useful to break that into data size and index sizes. If I add an index I’d like to know how much space it takes up, and also whether it is fully loaded into RAM or not.

On the topic of indexes, it’s hard to nail down which functions (eg cts:element-query)  make use of which indexes. Ideally there would be a more consistent naming convention of indexes and functions to make it more obvious, but an easier solution would be to simply add information to the function reference – so for each function some detail on which indexes will be used if available.

Tagged | Leave a comment

MarkLogic 4 – Initial thoughts

Last week I upgraded our development server to MarkLogic 4.0.1. The actual installation process was pretty simple – uninstall 3, install 4. Thankfully it kept all the databases, app servers and configuration. The only delay was reindexing the content, which took about 4 days for a 160gb database (3.3m fragments).

ML 4 now supports XQuery 1.0, but automatically sets existing app servers to use 0.9 to maintain compatibility. There were still a number of small code changes I had to make to get our apps running, seemingly because even in 0.9 mode it is still stricter with parsing/executing xquery code:

  • I had an xquery file which imported a module which i wasn’t actually using, and it had a syntax error in it. In 3.2 this didn’t cause any problems but in 4.0 it threw an exception. Makes sense, and it’s my fault I know but it’s a difference none the less..
  • Another issue arguably caused by my slack programming… When specifying the return type of a function, if the function actually returns a different type ML 4.0 now enforces a conversion to the type specified. e.g.

define function getTitle() as xs:string  {
    <title>the <b>title</b></title>
}

{getTitle()}

returns..

ML 3.2.8: <x><title>the <b>title</b></title></x>
ML 4.0.1: <x>the title</x>

No argument, ML 4 has it right, but it did break my app and took a while to figure out.

ML 4 has some interesting new features – XQuery 1.0, the geospacial stuff looks cool, and the forest level failover is something we’re going to use straight away. I was a bit disappointed they didn’t take the chance to address some other areas:

  • Still no integrated XSLT engine. Come on guys, just admit XQuery isn’t great at everything!
  • Indexes are still specified by QName.  It would be great to specify a path, so i can put an index on /article/author, to prevent it also indexing /article/reference/author
  • Registered queries are still too much of a pain to bother with. I mean – you have to call them by an id number and they don’t persist after a restart, so every time you call it you have to check it exists… pah.
  • Error messages. Still awful.
Tagged , | 1 Comment

XQuery support in Notepad++

UPDATE: A new version for XQuery 1.0 and MarkLogic 4.1 is available here:
XQuery support in Notepad++ XQuery 1.0 and MarkLogic 4.1 update

I’ve been struggling to find the best development setup for working with MarkLogic and XQuery. None of the standard text editors or IDEs I tried have support for the XQuery language. I prefer to use lightweight editors rather than resource hogs like like Visual Studio, XML Spy, Eclipse etc. so I set about adding XQuery support to my current favourite Notepad++, which handily supports defining custom languages.

It adds syntax highlighting and auto-complete of W3C and MarkLogic xquery functions. It’s based on ML 3.2, and I’ll update it at some point for 4.0 if/when we upgrade and I need to do it…

There’s a readme in the zip to show where to copy the files.

Notepad++XQY.zip (for XQuery 0.9/MarkLogic 3)

UPDATE: XQuery support in Notepad++ XQuery 1.0 and MarkLogic 4.1 update

Tagged , , | 2 Comments

Welcome!

So welcome to my new blog where I’ll be posting about interesting(?) things that I come across during my adventures in web development.

I’m currently working with a few different technologies, mainly:

Leave a comment